Received: by 2002:a05:6a10:1d13:0:0:0:0 with SMTP id pp19csp3824204pxb; Mon, 30 Aug 2021 11:28:48 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwFqaXnG3P3oHSq7kPADq6FqVgZMVKlgvcI14jp8+7sMZFPOvFyC55KbdFseSQ/BOH/uHS6 X-Received: by 2002:a17:906:29d3:: with SMTP id y19mr26429987eje.361.1630348128294; Mon, 30 Aug 2021 11:28:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1630348128; cv=none; d=google.com; s=arc-20160816; b=L7rpS5NoQmRmYhz7pW2aXeTm07/FAE+l02f9NRqwsf/zRGnKCZ5Y0DxmQahXfCfMnH W2JseiI6cjIbLyoxlbzgU2i+410kuDf8G35cxYI8d0ZvBHyhlQqQhxRs/ULDZF+EQsgg B6A9S9fSnHPR4DiBl/aRDqqmcq5GELUYOJvCzrzxHjYQqDaO4XWOaAtPzkYYj6bKjwFJ nVczRFfKiHEoQSVcgzp1N0XKmrpqpjzobUajc7frd0GekS63pQfkuUZqnzr0xqld3L20 4wZd9eUhVFatJBBIxZJRiPeCHi2iDVJUKOQAClnlflCkPuEBnlD6FueTonI+GsOZ79yA o3lA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=qFsHsk6AyJLgCY4KMOK4CQBK1/TyhhbrC84aarSuCoM=; b=axGIlFwCGzwX+iUOJDw65oNfIgB/GEdrjFJTGAlmkx7ob2DswK7TgPyqAB49BJKQdw wPJZaLLOEfsRCrdUwLGhXjEHfkgsDXBtoxhoMPWogLbZpSYBgUjCZ1V9puIchhmkxhU6 oGqSPf4cth1ZhRi3V5DnalvsDesp9zmYWJANSaPQkNoVHDa3XOn1KJLLAudg3LqoZPuZ EfZ3dtuemXJNFWdlkxBK4U1PkzMmBwKSOt5G9zVCWuLRVLwm/adBKh0Sldu8Ya90+0M9 fV1+74XfCuXoQ+AhqjpVP3FsLkGJ9PuPXAKj/Ymv14JSI1krp9vOfDwufcZu5UamZzik q7Zg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=K0CGUsgB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id ca20si14772750edb.178.2021.08.30.11.28.23; Mon, 30 Aug 2021 11:28:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=K0CGUsgB; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238528AbhH3SZX (ORCPT + 99 others); Mon, 30 Aug 2021 14:25:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47840 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238508AbhH3SZR (ORCPT ); Mon, 30 Aug 2021 14:25:17 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4F17FC0617A8; Mon, 30 Aug 2021 11:24:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=qFsHsk6AyJLgCY4KMOK4CQBK1/TyhhbrC84aarSuCoM=; b=K0CGUsgB6cXBq/GMuqmkJ7OFlm 2eP4ou1xcM9VX8gkbGOPDinAmPcwPQhVxyA+VUovkLmzwvbt3xUe5OrRMFqtD2Coi0hlUvq2MC4PT hJU5mUw20jETjrc7oNf59TetDuJCHFtqlEVwpEMLWN+PcWlC1OcqDdUOsRxjzwrWBrLkAEqa04GeE xM6MdeFVLkdONh+310fVcN1QHghYq9V9PIw4Hfai4wIRyivwBjZUQNPARFeWmk8jfIgYem1LpyT7o CRzIP3PybQe5yeb6SxwxXPb8t2sFAu6epo9esDaZR5cx1A3P4MoXmUvTj3zfH26u0NZWkj8l3mV/c cghSiFIA==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1mKlvN-000Pkf-DY; Mon, 30 Aug 2021 18:22:36 +0000 Date: Mon, 30 Aug 2021 19:22:25 +0100 From: Matthew Wilcox To: Johannes Weiner Cc: "Darrick J. Wong" , Linus Torvalds , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Andrew Morton Subject: Re: [GIT PULL] Memory folios for v5.15 Message-ID: References: <20210826004555.GF12597@magnolia> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 30, 2021 at 01:32:55PM -0400, Johannes Weiner wrote: > A lot of DC hosts nowadays are in a direct pipeline for handling user > requests, which are highly parallelizable. > > They are much smaller, and there are a lot more of them than there are > VMs in the world. The per-request and per-host margins are thinner, > and the compute-to-memory ratio is more finely calibrated than when > you're renting out large VMs that don't neatly divide up the machine. > > Right now, we're averaging ~1G of RAM per CPU thread for most of our > hosts. You don't need a very large system - certainly not in the TB > ballpark - where struct page takes up the memory budget of entire CPU > threads. So now we have to spec memory for it, and spend additional > capex and watts, or we'll end up leaving those CPU threads stranded. So you're noticing at the level of a 64 thread machine (something like a dual-socket Xeon Gold 5318H, which would have 2x18x2 = 72 threads). Things certainly have changed, then. > > The mistake you're making is coupling "minimum mapping granularity" with > > "minimum allocation granularity". We can happily build a system which > > only allocates memory on 2MB boundaries and yet lets you map that memory > > to userspace in 4kB granules. > > Yeah, but I want to do it without allocating 4k granule descriptors > statically at boot time for the entirety of available memory. Even that is possible when bumping the PAGE_SIZE to 16kB. It needs a bit of fiddling: static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte, unsigned long addr, struct page *page, pgprot_t prot) { if (!pte_none(*pte)) return -EBUSY; /* Ok, finally just insert the thing.. */ get_page(page); inc_mm_counter_fast(mm, mm_counter_file(page)); page_add_file_rmap(page, false); set_pte_at(mm, addr, pte, mk_pte(page, prot)); return 0; } mk_pte() assumes that a struct page refers to a single pte. If we revamped it to take (page, offset, prot), it could construct the appropriate pte for the offset within that page. --- Independent of _that_, the biggest problem we face (I think) in getting rid of memmap is that it offers the pfn_to_page() lookup. If we move to a dynamically allocated descriptor for our arbitrarily-sized memory objects, we need a tree to store them in. Given the trees we currently have, our best bet is probably the radix tree, but I dislike its glass jaws. I'm hoping that (again) the maple tree becomes stable soon enough for us to dynamically allocate memory descriptors and store them in it. And that we don't discover a bootstrapping problem between kmalloc() (for tree nodes) and memmap (to look up the page associated with a node). But that's all a future problem and if we can't even take a first step to decouple filesystems from struct page then working towards that would be wasted effort. > > > Willy says he has future ideas to make compound pages scale. But we > > > have years of history saying this is incredibly hard to achieve - and > > > it certainly wasn't for a lack of constant trying. > > > > I genuinely don't understand. We have five primary users of memory > > in Linux (once we're in a steady state after boot): > > > > - Anonymous memory > > - File-backed memory > > - Slab > > - Network buffers > > - Page tables > > > > The relative importance of each one very much depends on your workload. > > Slab already uses medium order pages and can be made to use larger. > > Folios should give us large allocations of file-backed memory and > > eventually anonymous memory. Network buffers seem to be headed towards > > larger allocations too. Page tables will need some more thought, but > > once we're no longer interleaving file cache pages, anon pages and > > page tables, they become less of a problem to deal with. > > > > Once everybody's allocating order-4 pages, order-4 pages become easy > > to allocate. When everybody's allocating order-0 pages, order-4 pages > > require the right 16 pages to come available, and that's really freaking > > hard. > > Well yes, once (and iff) everybody is doing that. But for the > foreseeable future we're expecting to stay in a world where the > *majority* of memory is in larger chunks, while we continue to see 4k > cache entries, anon pages, and corresponding ptes, yes? No. 4k page table entries are demanded by the architecture, and there's little we can do about that. We can allocate them in larger chunks, but let's not solve that problem in this email. I can see a world where anon memory is managed (by default, opportunistically) in larger chunks within a year. Maybe six months if somebody really works hard on it. > Memory is dominated by larger allocations from the main workloads, but > we'll continue to have a base system that does logging, package > upgrades, IPC stuff, has small config files, small libraries, small > executables. It'll be a while until we can raise the floor on those > much smaller allocations - if ever. > > So we need a system to manage them living side by side. > > The slab allocator has proven to be an excellent solution to this > problem, because the mailing lists are not flooded with OOM reports > where smaller allocations fragmented the 4k page space. And even large > temporary slab explosions (inodes, dentries etc.) are usually pushed > back with fairly reasonable CPU overhead. You may not see the bug reports, but they exist. Right now, we have a service that is echoing 2 to drop_caches every hour on systems which are lightly loaded, otherwise the dcache swamps the entire machine and takes hours or days to come back under control.