Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Wed, 27 Mar 2019 21:09:54 +0100
From:   Michal Hocko <mhocko@kernel.org>
To:     Yang Shi <yang.shi@linux.alibaba.com>
Cc:     Dan Williams <dan.j.williams@intel.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Rik van Riel <riel@surriel.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Dave Hansen <dave.hansen@intel.com>,
        Keith Busch <keith.busch@intel.com>,
        Fengguang Wu <fengguang.wu@intel.com>,
        "Du, Fan" <fan.du@intel.com>, "Huang, Ying" <ying.huang@intel.com>,
        Linux MM <linux-mm@kvack.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
Message-ID: <20190327193918.GP11927@dhcp22.suse.cz>
References: <1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com>
 <20190326135837.GP28406@dhcp22.suse.cz>
 <43a1a59d-dc4a-6159-2c78-e1faeb6e0e46@linux.alibaba.com>
 <20190326183731.GV28406@dhcp22.suse.cz>
 <f08fb981-d129-3357-e93a-a6b233aa9891@linux.alibaba.com>
 <20190327090100.GD11927@dhcp22.suse.cz>
 <CAPcyv4heiUbZvP7Ewoy-Hy=-mPrdjCjEuSw+0rwdOUHdjwetxg@mail.gmail.com>
 <c3690a19-e2a6-7db7-b146-b08aa9b22854@linux.alibaba.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <c3690a19-e2a6-7db7-b146-b08aa9b22854@linux.alibaba.com>
User-Agent: Mutt/1.10.1 (2018-07-13)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Wed 27-03-19 11:59:28, Yang Shi wrote:
> 
> 
> On 3/27/19 10:34 AM, Dan Williams wrote:
> > On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > On Tue 26-03-19 19:58:56, Yang Shi wrote:
[...]
> > > > It is still NUMA, users still can see all the NUMA nodes.
> > > No, Linux NUMA implementation makes all numa nodes available by default
> > > and provides an API to opt-in for more fine tuning. What you are
> > > suggesting goes against that semantic and I am asking why. How is pmem
> > > NUMA node any different from any any other distant node in principle?
> > Agree. It's just another NUMA node and shouldn't be special cased.
> > Userspace policy can choose to avoid it, but typical node distance
> > preference should otherwise let the kernel fall back to it as
> > additional memory pressure relief for "near" memory.
> 
> In ideal case, yes, I agree. However, in real life world the performance is
> a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has
> higher latency and lower bandwidth. We observed much higher latency on PMEM
> than DRAM with multi threads.

One rule of thumb is: Do not design user visible interfaces based on the
contemporary technology and its up/down sides. This will almost always
fire back.

Btw. if you keep arguing about performance without any numbers. Can you
present something specific?

> In real production environment we don't know what kind of applications would
> end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have
> unexpected performance degradation. I understand to have mempolicy to choose
> to avoid it. But, there might be hundreds or thousands of applications
> running on the machine, it sounds not that feasible to me to have each
> single application set mempolicy to avoid it.

we have cpuset cgroup controller to help here.

> So, I think we still need a default allocation node mask. The default value
> may include all nodes or just DRAM nodes. But, they should be able to be
> override by user globally, not only per process basis.
> 
> Due to the performance disparity, currently our usecases treat PMEM as
> second tier memory for demoting cold page or binding to not memory access
> sensitive applications (this is the reason for inventing a new mempolicy)
> although it is a NUMA node.

If the performance sucks that badly then do not use the pmem as NUMA,
really. There are certainly other ways to export the pmem storage. Use
it as a fast swap storage. Or try to work on a swap caching mechanism
that still allows much faster access than a slow swap storage. But do
not try to pretend to abuse the NUMA interface while you are breaking
some of its long term established semantics.
-- 
Michal Hocko
SUSE Labs