Received: by 2002:a05:6a11:4021:0:0:0:0 with SMTP id ky33csp383285pxb; Thu, 23 Sep 2021 02:08:39 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzeUxrYTJV2ftpTs81Wc9ArhOFdV+xYZNB1T3BX9eP3C3aCtCxa4G3vTJygk2MezMicSXH8 X-Received: by 2002:a17:906:5d6:: with SMTP id t22mr3839416ejt.98.1632388119757; Thu, 23 Sep 2021 02:08:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1632388119; cv=none; d=google.com; s=arc-20160816; b=i8BryfCWmczqyYuGpnFVyHxNUUjmUz3mp0Zng/Yc3qsD3wu84yYpWuQY808bTeF6mb 8EX81/XJhNgnqIwPU2q2U/1tJ3RajaRgmee3+ZbPvDZ7DQ3984I1Mbz2qFqq4iEsGzFI MoOcGJEiNoshuIdfBbYF53prFalk5cJP1nBke/gDRvQt932c1hElcHAkOBfj6Lxvhvlh FZaiMh1jNrdHQUTPkr/B/N4+R3x8PA70rYIQGVsD49BzbprQ9w+EvuIy2h6RJpby3DOO 8PMJKX2+cuTEELcq2p2tGeEPm7UyXjsXDOpIFM2cbiCCEPshQAx8fNtjle3PFwVs0uDW V+5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:subject :organization:from:references:cc:to:dkim-signature; bh=t2vZx6Tk6VuuMikWnLUs2Mq5yA5lp/FlcON1t/+GITs=; b=MYIrFzYMbBMSNcDLs3bGmzIvQs5pCIHna1ujk6tNzf/FUi/RHJWNAgopi93aeVXWjp CVTm25en+rH2+VHouzSn0JJaV6xdZiEt0byertVYhI4TSfuYPWPcC/pVBhFgyWiywC9g vUjUktRdM1bRtZMQuPA8XKY+/Uxzip7xtIug+XZd6qnw1Iz55HF9RSpAFY+VXVzec5BX 6uhJedN7nyZSeny+pLFZs6/E+vinL/OENouczZjDX1hAhtZgXOaZF0vLK4D6G/+ARGZX ozuxBPRYR2kS2J/Nilc9pFqLQR67H8RJV6gtp1yd5BW9W2dRQHKmwUX1wWQ7WkeyXH6t trug== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=E7V9plAG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id 5si5561432ejq.398.2021.09.23.02.08.09; Thu, 23 Sep 2021 02:08:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=E7V9plAG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240111AbhIWJFU (ORCPT + 99 others); Thu, 23 Sep 2021 05:05:20 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:38329 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240105AbhIWJFU (ORCPT ); Thu, 23 Sep 2021 05:05:20 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1632387828; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=t2vZx6Tk6VuuMikWnLUs2Mq5yA5lp/FlcON1t/+GITs=; b=E7V9plAGN3YQUB4OJbw0296TnH2SR7uLl0A291A4xLoG2YtBBnADKHniCNywunlDamFy2g 7gjWap+NCMoktes4AbBpPIwdpkt3e2icobiwQYLcB6yWttHRdul+rg1WS52H7AYFm43Osg Oa3hEXexnuEFXMcqVmLQV3PazDa7xls= Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com [209.85.221.72]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-227-Uz_2U-uSNrOXU0TFWOoBSg-1; Thu, 23 Sep 2021 05:03:47 -0400 X-MC-Unique: Uz_2U-uSNrOXU0TFWOoBSg-1 Received: by mail-wr1-f72.google.com with SMTP id f11-20020adfc98b000000b0015fedc2a8d4so4625048wrh.0 for ; Thu, 23 Sep 2021 02:03:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=t2vZx6Tk6VuuMikWnLUs2Mq5yA5lp/FlcON1t/+GITs=; b=sCpdvVQbB6iHFco249y4L6GYMdxzPUXxWwU6HesksIgXyli16t72agRHiO7ZGs+9bs f7ElnqB+KVg1cSyDy+Pn1kV+tNrEje3J0mjvluNrVn6ozKA5qMu1WyyUIlJod7Z+zL42 uVHnhqh8SEf7tKEp8VbEdhNFIy7jcR4Mruf2WukJyKFc+kAiZ18ESOVZ8FwXwLVdCQ2a NCTr3dN5Yxu4kP6quTE+VhsAKLXuhtqe3tWt30f2k7PoCJxQHOyY5xtXty61IjgteZsA /jm9SvHHXOdQaN9Y9QA2eSb2MSwjawroOern4oDp0NYKcayINH5KnSqRx0z+OIMjYbb5 wp+Q== X-Gm-Message-State: AOAM533JO73nnFsVwNrsEXVakF0YCZj9n1t5xO7wa5w6WkfW4zbfh+vq 8jXMYFimNUG2wc7G4drMHkuZF7DdJSgNxaethPyQQ38xYGqLJoygxis5aIe87cwKCgnSmtlzngC 3sWpQyY6uCp009eJpzmq48/Ls X-Received: by 2002:a5d:598c:: with SMTP id n12mr3580690wri.391.1632387826306; Thu, 23 Sep 2021 02:03:46 -0700 (PDT) X-Received: by 2002:a5d:598c:: with SMTP id n12mr3580658wri.391.1632387825997; Thu, 23 Sep 2021 02:03:45 -0700 (PDT) Received: from [192.168.3.132] (p4ff23e5d.dip0.t-ipconnect.de. [79.242.62.93]) by smtp.gmail.com with ESMTPSA id a77sm4539713wme.28.2021.09.23.02.03.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 23 Sep 2021 02:03:45 -0700 (PDT) To: Kent Overstreet , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Johannes Weiner , Matthew Wilcox , Linus Torvalds , Andrew Morton , "Darrick J. Wong" , Christoph Hellwig , David Howells References: From: David Hildenbrand Organization: Red Hat Subject: Re: Struct page proposal Message-ID: Date: Thu, 23 Sep 2021 11:03:44 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 23.09.21 03:21, Kent Overstreet wrote: > One thing that's come out of the folios discussions with both Matthew and > Johannes is that we seem to be thinking along similar lines regarding our end > goals for struct page. > > The fundamental reason for struct page is that we need memory to be self > describing, without any context - we need to be able to go from a generic > untyped struct page and figure out what it contains: handling physical memory > failure is the most prominent example, but migration and compaction are more > common. We need to be able to ask the thing that owns a page of memory "hey, > stop using this and move your stuff here". > > Matthew's helpfully been coming up with a list of page types: > https://kernelnewbies.org/MemoryTypes > > But struct page could be a lot smaller than it is now. I think we can get it > down to two pointers, which means it'll take up 0.4% of system memory. Both > Matthew and Johannes have ideas for getting it down even further - the main > thing to note is that virt_to_page() _should_ be an uncommon operation (most of > the places we're currently using it are completely unnecessary, look at all the > places we're using it on the zero page). Johannes is thinking two layer radix > tree, Matthew was thinking about using maple trees - personally, I think that > 0.4% of system memory is plenty good enough. > > > Ok, but what do we do with the stuff currently in struct page? > ------------------------------------------------------------- > > The main thing to note is that since in normal operation most folios are going > to be describing many pages, not just one - and we'll be using _less_ memory > overall if we allocate them separately. That's cool. > > Of course, for this to make sense, we'll have to get all the other stuff in > struct page moved into their own types, but file & anon pages are the big one, > and that's already being tackled. > > Why two ulongs/pointers, instead of just one? > --------------------------------------------- > > Because one of the things we really want and don't have now is a clean division > between allocator and allocatee state. Allocator meaning either the buddy > allocator or slab, allocatee state would be the folio or the network pool state > or whatever actually called kmalloc() or alloc_pages(). > > Right now slab state sits in the same place in struct page where allocatee state > does, and the reason this is bad is that slab/slub are a hell of a lot faster > than the buddy allocator, and Johannes wants to move the boundary between slab > allocations and buddy allocator allocations up to like 64k. If we fix where slab > state lives, this will become completely trivial to do. > > So if we have this: > > struct page { > unsigned long allocator; > unsigned long allocatee; > }; > > The allocator field would be used for either a pointer to slab/slub's state, if > it's a slab page, or if it's a buddy allocator page it'd encode the order of the > allocation - like compound order today, and probably whether or not the > (compound group of) pages is free. > > The allocatee field would be used for a type tagged (using the low bits of the > pointer) to one of: > - struct folio > - struct anon_folio, if that becomes a thing > - struct network_pool_page > - struct pte_page > - struct zone_device_page > > Then we can further refactor things until all the stuff that's currently crammed > in struct page lives in types where each struct field means one and precisely > one thing, and also where we can freely reshuffle and reorganize and add stuff > to the various types where we couldn't before because it'd make struct page > bigger. > > Other notes & potential issues: > - page->compound_dtor needs to die > > - page->rcu_head moves into the types that actually need it, no issues there > > - page->refcount has question marks around it. I think we can also just move it > into the types that need it; with RCU derefing the pointer to the folio or > whatever and grabing a ref on folio->refcount can happen under a RCU read > lock - there's no real question about whether it's technically possible to > get it out of struct page, and I think it would be cleaner overall that way. > > However, depending on how it's used from code paths that go from generic > untyped pages, I could see it turning into more of a hassle than it's worth. > More investigation is needed. > > - page->memcg_data - I don't know whether that one more properly belongs in > struct page or in the page subtypes - I'd love it if Johannes could talk > about that one. > > - page->flags - dealing with this is going to be a huge hassle but also where > we'll find some of the biggest gains in overall sanity and readability of the > code. Right now, PG_locked is super special and ad hoc and I have run into > situations multiple times (and Johannes was in vehement agreement on this > one) where I simply could not figure the behaviour of the current code re: > who is responsible for locking pages without instrumenting the code with > assertions. > > Meaning anything we do to create and enforce module boundaries between > different chunks of code is going to suck, but the end result should be > really worthwhile. > > Matthew Wilcox and David Howells have been having conversations on IRC about > what to do about other page bits. It appears we should be able to kill a lot of > filesystem usage of both PG_private and PG_private_2 - filesystems in general > hang state off of page->private, soon to be folio->private, and PG_private in > current use just indicates whether page->private is nonzero - meaning it's > completely redundant. > Don't get me wrong, but before there are answers to some of the very basic questions raised above (especially everything that lives in page->flags, which are not only page flags, refcount, ...) this isn't very tempting to spend more time on, from a reviewer perspective. -- Thanks, David / dhildenb