Received: by 2002:a25:d7c1:0:0:0:0:0 with SMTP id o184csp2446219ybg; Thu, 24 Oct 2019 09:46:53 -0700 (PDT) X-Google-Smtp-Source: APXvYqx4bZiQMUHfcLmJhdvTyDTRSNvn4AdrpFhwUDXmiRI7ypmteG4HW3WsLG7LFV8hVm7LsBlX X-Received: by 2002:a50:e40c:: with SMTP id d12mr43636373edm.256.1571935613842; Thu, 24 Oct 2019 09:46:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1571935613; cv=none; d=google.com; s=arc-20160816; b=dPSpnd+dGbHBHic+YJH4yKEkqb4k1ql4D0iS4yZqZU1MSzomFe9aMUjOgxvIFRs+t6 SjBPyKdqisw7OmAnos6TVUptvalSI4OtaHbL1MKAMDheCsGpDCFi7QTFQhxh7yVr0nae n2ByrE2WJ/Z2tqTafkWAkALCsq47zQZOB+Ux/lLJoSCJp5qt+55Ou1r4QjULoVoGl3ih 8KiPphWJSuj4eUJ3la2lqFGyPKbwvG+xu9CjExLU0GoipDYo+h1e61ZdK6KxoRzXtyk0 /ZBNY1ioYMEz1f9exOAgMXQshADbAjeGhaFBYZcV/SCd5tYwQMjj1ZHNHRLE9TH1yZYU aiaQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=6chDs+IgEJKamu+cJb66Y7u49Uq6uTH7s8nRkQlDdp4=; b=Gc51Bzn6BHqZqZIvcDZRrqc/szRfT5waS1QSy+GJManR1KIkN6yYB1SmPj6OdlYHBR 68AsYeLVphmFZ+yXLnDoDJP9fFR+tuYKcrJ2gCyT7ocEYfS8A8qP9fq5SI0ueoB/H7Bq /2yVHnAzLopTo2i3uf5VRH3rYJ+5uuKPFgnciOQJNbY13iaukOZkH4P4eWh0MT+dhOQs 3XS6dBbxgo8c8yyzTyohbZWY/jX3x4Vg0cYW3ORFFE6ehwtBVdd8Pl/5ZL3nNwdUIUhw C4IQUffpCHTKTnTey0pMDpFinTgP8nwVK9SGjKdVdVTeN8cMnOhov/wvedw+UPYWTPHO Zgsg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=LPKaA53L; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s29si6082434edc.262.2019.10.24.09.46.29; Thu, 24 Oct 2019 09:46:53 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=LPKaA53L; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2408008AbfJWXMW (ORCPT + 99 others); Wed, 23 Oct 2019 19:12:22 -0400 Received: from mail-qt1-f182.google.com ([209.85.160.182]:36729 "EHLO mail-qt1-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728773AbfJWXMW (ORCPT ); Wed, 23 Oct 2019 19:12:22 -0400 Received: by mail-qt1-f182.google.com with SMTP id d17so20213851qto.3 for ; Wed, 23 Oct 2019 16:12:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=6chDs+IgEJKamu+cJb66Y7u49Uq6uTH7s8nRkQlDdp4=; b=LPKaA53L94ItgBQ7cRZRd0OF0ksHyc+ImxDvEoxDh+IWg0ADUvqw/DCa45By3Onsik gEU3iQy9d0ZpW+jKvEhPB/dns4TYCrDRKf0CiWDDwH5SKMn8XbAbBPGiBn+/jWiyLdHV awdW1hPhMkH/78y45KV6mVxeOLr1oGi9oABZLlyG7YxD8RpylW8NaZFy2c18py4LFwrr mgLjOB3reVQsqTcuG1jtcwbqv+gBtAHK4fwACqARXqd892/pvyR2RP3+YIgeqONkcrAy 32DsNnCm4eybLA6NFqb4tqohrizzvQFkrEqxQ/0NCElAzRP796aEz92xJmbwSlNVwvTS wIOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=6chDs+IgEJKamu+cJb66Y7u49Uq6uTH7s8nRkQlDdp4=; b=I9YwL5CQRUSPf8M0cRoxtlEiSq+gS8+IZm6ZbmeVsJ85qEtDKMpNHs65mIemXnGrq2 49lpl4SvB8y3SGcZm3wW5xDFaldP+3+FK27XuvaYqESxoRxhNDCslHY2MaAzMulPG3iA Mo6Gtt6bz4zrA7EQBB0kvl8mSpL8TRzSJb1nsM/blscpDGntGomtw/P1cpIALt3SxFWd n/8CP7HnkRut2Mbap+UMUisvxFCJ/vWhXpDT4hRD0gQmnXV/1wBTp+VfkrEzg9aLn5pJ wjITUYhALxkOfcPHLvQUWpkjj7nccnDOTfHHvP4NqOw3TsvoFW1NAPee9B2Zg0z6//n6 QtmQ== X-Gm-Message-State: APjAAAXOuJ0n1JVVpZLKCWt2N88yted0ueeGyQ/8R8fo28uMxli5yZin tI4/iyYt+GO5NBFTcbWQoLs4G+pXwS8uIU859z3UE7YWHEeO X-Received: by 2002:aed:3f57:: with SMTP id q23mr1240067qtf.116.1571872338418; Wed, 23 Oct 2019 16:12:18 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jonathan Adams Date: Wed, 23 Oct 2019 16:11:39 -0700 Message-ID: Subject: Re: [RFC] Memory Tiering To: Dave Hansen Cc: Linux-MM , LKML , "Williams, Dan J" , "Verma, Vishal L" , Wu Fengguang , Huang Ying Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 1 On Wed, Oct 16, 2019 at 1:05 PM Dave Hansen wrote: > > The memory hierarchy is getting more complicated and the kernel is > playing an increasing role in managing the different tiers. A few > different groups of folks described "migration" optimizations they were > doing in this area at LSF/MM earlier this year. One of the questions > folks asked was why autonuma wasn't being used. > > At Intel, the primary new tier that we're looking at is persistent > memory (PMEM). We'd like to be able to use "persistent memory" > *without* using its persistence properties, treating it as slightly > slower DRAM. Keith Busch has some patches to use NUMA migration to > automatically migrate DRAM->PMEM instead of discarding it near the end > of the reclaim process. Huang Ying has some patches which use a > modified autonuma to migrate frequently-used data *back* from PMEM->DRAM. > > We've tried to do this all generically so that it is not tied to > persistent memory and can be applied to any memory types in lots of > topologies. > > We've been running this code in various forms for the past few months, > comparing it to pure DRAM and hardware-based caching. The initial > results are encouraging and we thought others might want to take a look > at the code or run their own experiments. We're expecting to post the > individual patches soon. But, until then, the code is available here: > > https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git > > and is tagged with "tiering-0.2", aka. d8e31e81b1dca9. Hi Dave, Thanks for sharing this git link and information on your approach. This interesting, and lines up somewhat with the approach Google has been investigating. As we discussed at LSF/MM[1] and Linux Plumbers[2], we're working on an approach which integrates with our proactive reclaim work, with a similar attitude to PMEM (use it as "slightly slower" DRAM, ignoring its persistence). The prototype we have has a similar basic structure to what you're doing here and Yang Shi's patchset from March[3] (separate NUMA nodes for PMEM), but relies on a fair amount of kernel changes to control allocations from the NUMA nodes, and uses a similar "is_far" NUMA flag to Yang Shi's approach. We're working on redesigning to reduce the scope of kernel changes and to remove the "is_far" special handling; we still haven't refined down to a final approach, but one basic part we want to keep from the prototype is proactively pushing PMEM data back to DRAM when we've noticed it's in use. If we look at a two-socket system: A: DRAM & CPU node for socket 0 B: PMEM node for socket 0 C: DRAM & CPU node for socket 1 D: PMEM node for socket 1 instead of the unidirectional approach your patches go for: A is marked as "in reclaim, push pages to" B C is marked as "in reclaim, push pages to" D B & D have no markings we would have a bidirectional attachment: A is marked "move cold pages to" B B is marked "move hot pages to" A C is marked "move cold pages to" D D is marked "move hot pages to" C By using autonuma for moving PMEM pages back to DRAM, you avoid needing the B->A & D->C links, at the cost of migrating the pages back synchronously at pagefault time (assuming my understanding of how autonuma works is accurate). Our approach still lets you have multiple levels of hierarchy for a given socket (you could imaging an "E" node with the same relation to "B" as "B" has to "A"), but doesn't make it easy to represent (say) an "E" which was equally close to all sockets (which I could imagine for something like remote memory on GenZ or what-have-you), since there wouldn't be a single back link; there would need to be something like your autonuma support to achieve that. Does that make sense? Thanks, - Jonathan [1] Shakeel's talk, I can't find a link at the moment. The basic kstaled/kreclaimd approach we built upon is talked about in https://blog.acolyer.org/2019/05/22/sw-far-memory/ and the linked ASPLOS paper [2] https://linuxplumbersconf.org/event/4/contributions/561/; slides at https://linuxplumbersconf.org/event/4/contributions/561/attachments/363/596/Persistent_Memory_as_Memory.pdf [3] https://lkml.org/lkml/2019/3/23/10