Received: by 2002:a25:d7c1:0:0:0:0:0 with SMTP id o184csp4036039ybg; Fri, 25 Oct 2019 12:23:49 -0700 (PDT) X-Google-Smtp-Source: APXvYqxP+FW/Djj7F4nH0Xblng6sYmywEKzVoql6hzGVbFi8poYs66ffuaaSiHZmIuwnqaFREvOK X-Received: by 2002:a50:959a:: with SMTP id w26mr5780683eda.214.1572031429424; Fri, 25 Oct 2019 12:23:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1572031429; cv=none; d=google.com; s=arc-20160816; b=qPhDze78/x0qx/jDseq7Zwwl9xvhCZGhCX3tm9XTqP/5ZPCYmToEdtBJDxLmQ/nCD3 GCTr8dBo++yGzoL3p7JEJu3h643MMHaVMf0O262ttCxzC127Ouej+2Pf5f/3YKmottYu gw6ucr5uct8HTRYO/OaJcYDhGMP0SCNikEq9w4N4G9Lm1BWQMlkvo51N3UKokva0gD8/ DsHt7do8B9RIF+bk617R+t9FjVesajk3eVZG/x68gQ7RjGkea+sRuB0Ld7WZozqmDK9P d9OmQ2t9TePpE8u7CWMwxVEZuuzI9b7YYzOo0h4xVMjsfky9Z0/VozmF2ZwnBRcZvHVu jz5w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:message-id :in-reply-to:date:references:subject:cc:to:from; bh=AlmE3l+dtQVqbrqqIsrc9Ted2mZy2TFJ9aGyu6acrW0=; b=aJ+dpWaspNtyKYkBV/wzextbcR7LSuEvVxZRXsP6f0dyGUvPdBof2UOyMYQl9EVjJJ wapENbkQen6wHlMso8GUTBwkvILP2+DPzwrhdeePDXHwdfeYABL8GmvPdo1+J4Q5Z1NF KBd3o7xH/LqCgzLQXNAJIRL4nxWd5byReQV+gRZXTA5Qvt+m40hzVk8q3WP876uWjWRK PAO72O2zFUdOyqflbO2qTcGY62WFKcdsdIPffu4n5UAcmJ24y2ToE+Bjdbi6fM+8vQCj C0wPP03CGy2ji5tn2nSOVmlPbdB8HHGO5tXa42cqJlHev0HecF8Z7pM8TbJHKfOXWqWh vxzA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id hk1si1847600ejb.408.2019.10.25.12.23.26; Fri, 25 Oct 2019 12:23:49 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2392172AbfJYDkP (ORCPT + 99 others); Thu, 24 Oct 2019 23:40:15 -0400 Received: from mga05.intel.com ([192.55.52.43]:44679 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727145AbfJYDkP (ORCPT ); Thu, 24 Oct 2019 23:40:15 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 24 Oct 2019 20:40:14 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.68,227,1569308400"; d="scan'208";a="398639982" Received: from yhuang-dev.sh.intel.com (HELO yhuang-dev) ([10.239.159.29]) by fmsmga005.fm.intel.com with ESMTP; 24 Oct 2019 20:40:13 -0700 From: "Huang\, Ying" To: Jonathan Adams Cc: Dave Hansen , Linux-MM , LKML , "Williams\, Dan J" , "Verma\, Vishal L" , Wu Fengguang Subject: Re: [RFC] Memory Tiering References: Date: Fri, 25 Oct 2019 11:40:12 +0800 In-Reply-To: (Jonathan Adams's message of "Wed, 23 Oct 2019 16:11:39 -0700") Message-ID: <87k18th4rn.fsf@yhuang-dev.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Jonathan Adams writes: > 1 > On Wed, Oct 16, 2019 at 1:05 PM Dave Hansen wrote: >> >> The memory hierarchy is getting more complicated and the kernel is >> playing an increasing role in managing the different tiers. A few >> different groups of folks described "migration" optimizations they were >> doing in this area at LSF/MM earlier this year. One of the questions >> folks asked was why autonuma wasn't being used. >> >> At Intel, the primary new tier that we're looking at is persistent >> memory (PMEM). We'd like to be able to use "persistent memory" >> *without* using its persistence properties, treating it as slightly >> slower DRAM. Keith Busch has some patches to use NUMA migration to >> automatically migrate DRAM->PMEM instead of discarding it near the end >> of the reclaim process. Huang Ying has some patches which use a >> modified autonuma to migrate frequently-used data *back* from PMEM->DRAM. >> >> We've tried to do this all generically so that it is not tied to >> persistent memory and can be applied to any memory types in lots of >> topologies. >> >> We've been running this code in various forms for the past few months, >> comparing it to pure DRAM and hardware-based caching. The initial >> results are encouraging and we thought others might want to take a look >> at the code or run their own experiments. We're expecting to post the >> individual patches soon. But, until then, the code is available here: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git >> >> and is tagged with "tiering-0.2", aka. d8e31e81b1dca9. > > Hi Dave, > > Thanks for sharing this git link and information on your approach. > This interesting, and lines up somewhat with the approach Google has > been investigating. As we discussed at LSF/MM[1] and Linux > Plumbers[2], we're working on an approach which integrates with our > proactive reclaim work, with a similar attitude to PMEM (use it as > "slightly slower" DRAM, ignoring its persistence). The prototype we > have has a similar basic structure to what you're doing here and Yang > Shi's patchset from March[3] (separate NUMA nodes for PMEM), but > relies on a fair amount of kernel changes to control allocations from > the NUMA nodes, and uses a similar "is_far" NUMA flag to Yang Shi's > approach. > > We're working on redesigning to reduce the scope of kernel changes and > to remove the "is_far" special handling; we still haven't refined > down to a final approach, but one basic part we want to keep from the > prototype is proactively pushing PMEM data back to DRAM when we've > noticed it's in use. If we look at a two-socket system: > > A: DRAM & CPU node for socket 0 > B: PMEM node for socket 0 > C: DRAM & CPU node for socket 1 > D: PMEM node for socket 1 > > instead of the unidirectional approach your patches go for: > > A is marked as "in reclaim, push pages to" B > C is marked as "in reclaim, push pages to" D > B & D have no markings > > we would have a bidirectional attachment: > > A is marked "move cold pages to" B > B is marked "move hot pages to" A > C is marked "move cold pages to" D > D is marked "move hot pages to" C > > By using autonuma for moving PMEM pages back to DRAM, you avoid > needing the B->A & D->C links, at the cost of migrating the pages > back synchronously at pagefault time (assuming my understanding of how Yes. It's "synchronously". But it will avoid most time-consuming direct-reclaiming or page lock acquiring, so the latency will not be uncontrollable. > autonuma works is accurate). > > Our approach still lets you have multiple levels of hierarchy for a > given socket (you could imaging an "E" node with the same relation to > "B" as "B" has to "A"), but doesn't make it easy to represent (say) an > "E" which was equally close to all sockets (which I could imagine for > something like remote memory on GenZ or what-have-you), since there > wouldn't be a single back link; there would need to be something like > your autonuma support to achieve that. If hot pages in PMEM is identified via check "A" bit in PTE, there's no information about which CPU is accessing the pages. One way to mitigate is to use the original AutoNUMA. For example, the hot pages may be migrated B -> A: via PMEM hot page promotion A -> C: via original AutoNUMA > Does that make sense? Yes. Definitely. Best Regards, Huang, Ying > Thanks, > - Jonathan > > [1] Shakeel's talk, I can't find a link at the moment. The basic > kstaled/kreclaimd approach we built upon is talked about in > https://blog.acolyer.org/2019/05/22/sw-far-memory/ and the linked > ASPLOS paper > [2] https://linuxplumbersconf.org/event/4/contributions/561/; slides > at > https://linuxplumbersconf.org/event/4/contributions/561/attachments/363/596/Persistent_Memory_as_Memory.pdf > [3] https://lkml.org/lkml/2019/3/23/10