Received: by 10.192.165.148 with SMTP id m20csp3827205imm; Mon, 7 May 2018 20:53:27 -0700 (PDT) X-Google-Smtp-Source: AB8JxZpw/CRlfL1MbJxp59L9fUmuFk3N5ZDZDRDFkh8E0deDtqI14KE2lELHK/FkZFg2G43+Xbdc X-Received: by 2002:a17:902:7685:: with SMTP id m5-v6mr8518429pll.340.1525751607441; Mon, 07 May 2018 20:53:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525751607; cv=none; d=google.com; s=arc-20160816; b=D1RH/mY3WgHCv6bTWlfWlZgUkKW74KdHAm7pAccGxf47ZCzcD64P7TI5pujrP+mw+z e1w6OLIVYXApv3T6z7EtoB6jQORfz14OlhI3UfcutcmtvbXiLKmurrQbBXKHijofZ1zN 6U0Qf0cdC/aBJTfZ4JH/B/4n5tXMklGzmQLZ49GWjQKrgsGfH3apWD/OzPOIZqxC1enO ZPZVwScHvhqpFcUHRLtpdBN/cDMr9KYzSOdrqWxIpol8u/wFXpVX/cCJobJVf9cBlF2R myOHd+jB0nQfIHvaUXI09DfnDKwkNouvDc47AvQFHhjMEIxfe2+vXyIGd/JN90Up8EKL u88A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:references:in-reply-to:mime-version :dkim-signature:arc-authentication-results; bh=o7TttCBx99t6F36SOSI10fyFGZY2trDJTR/Z5XL1C+Y=; b=xSXxQkcx3WiTN3uy9j28JITBw/eb5xZ+dsNI9eRC9qUzCdJedDLcCuIxxSp0Uzbj00 YRIAffsRBOw3kxQuUN59TuRKVsdnDCjLdcqOHD9PMrmeCnLtdWWkAUd66l5TAX6BhejV yOyhaQBIoEIZda+YFh2pvi2vHArmGE3VNeruw126ZUch2GMfTS5C/Wb2U0IdWGPQMhZc ccb1kfePALIs8mZc6jxkupJXjUZvBBJ6MKEM1CnV1Xhs2x5XhVzk+wz31cy/qo8AATnQ tFoshghqo1N5WWDWQCCicXVtVBsNmJMXi0hkVeMYs2pp2CyA/uKUtVwsPlcy7AjnZhHB fICg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=iiyFgy18; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f16si23622311pfe.291.2018.05.07.20.53.12; Mon, 07 May 2018 20:53:27 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=iiyFgy18; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754292AbeEHDxA (ORCPT + 99 others); Mon, 7 May 2018 23:53:00 -0400 Received: from mail-ot0-f172.google.com ([74.125.82.172]:44286 "EHLO mail-ot0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754245AbeEHDw7 (ORCPT ); Mon, 7 May 2018 23:52:59 -0400 Received: by mail-ot0-f172.google.com with SMTP id g7-v6so34522221otj.11 for ; Mon, 07 May 2018 20:52:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=o7TttCBx99t6F36SOSI10fyFGZY2trDJTR/Z5XL1C+Y=; b=iiyFgy18S6xPVV5/wibSVvdEe+9KryHxwwvk4CEnzNXwtnWirDmHXqGb0WxPWyqN27 QMpKf0ATa3DEz9Zwh6Md4FKPxNjnqoKnaDM94mSCoD0QmYBYY34Dwwdwo/+aEX62R78/ LLuNncgOPqEqrH8p8TCvE0gM6Xkw2bJOozxQHdY/YT++O6VzdPG1A0JUKFxxyU32zVYj zDGALWV5YKGLMJxRIylq7vXldY4wB2PY99jzRUxEhqq56cE5DuGpZzcMIn/VTZ8WJn58 wvRsdIamkKtSs576xAwR8Xm76+BGheuP4zCZeHNc3VpKWm3oeoj7jBoXAmFtDrxBA5PW iOVA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=o7TttCBx99t6F36SOSI10fyFGZY2trDJTR/Z5XL1C+Y=; b=hcY7ai3iQOF579rqB+9NVSMjlQt83I2/Jtk02Vdm88IKiK6Dirh3J7GIKiDg/jxO4B BPW3ov79zPZI1l3ujmpO3TroyqLr2PxJHj/gqPwgE17jUL3EAV3lO6ygDPW7lk1P/vjy GyNWx5RaHZxKfN6TseFsU3Waaol6FrP2sobvpyO0on6Q2j/A9JAkE6eaqWeYlFLR3u8k T+kCewoe9VlKUhLv5gD61IOSmk9tsltW0ulDTh6pivu2qFq7kMYRYpM5g6JCmBfP5i2N SVSQILHGrCYt33jaugQXwsz9eqV7M8Yq9uyXQYmDoS0waMWBpfK1VUIvVmnAMSUv4qBf 0rjQ== X-Gm-Message-State: ALQs6tCQa9eUXe2N2joanIMmR/UKWNC7dPZ8xSZ4etirfJRmg/J+roPI jeJ3pLT2sFxeBeYyXJceVAULyfj/feziZs789O99LA== X-Received: by 2002:a9d:491d:: with SMTP id e29-v6mr26207956otf.35.1525751578404; Mon, 07 May 2018 20:52:58 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a9d:2d36:0:0:0:0:0 with HTTP; Mon, 7 May 2018 20:52:57 -0700 (PDT) In-Reply-To: References: <1525704627-30114-1-git-send-email-yehs1@lenovo.com> <20180507184622.GB12361@bombadil.infradead.org> From: Dan Williams Date: Mon, 7 May 2018 20:52:57 -0700 Message-ID: Subject: Re: [External] Re: [RFC PATCH v1 0/6] use mm to manage NVDIMM (pmem) zone To: Huaisheng HS1 Ye Cc: Jeff Moyer , Matthew Wilcox , Michal Hocko , linux-nvdimm , Tetsuo Handa , NingTing Cheng , Dave Hansen , Linux Kernel Mailing List , "pasha.tatashin@oracle.com" , Linux MM , "colyli@suse.de" , Johannes Weiner , Andrew Morton , Sasha Levin , Mel Gorman , Vlastimil Babka , Mikulas Patocka Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 7, 2018 at 7:59 PM, Huaisheng HS1 Ye wrote: >> >>Dan Williams writes: >> >>> On Mon, May 7, 2018 at 11:46 AM, Matthew Wilcox >>wrote: >>>> On Mon, May 07, 2018 at 10:50:21PM +0800, Huaisheng Ye wrote: >>>>> Traditionally, NVDIMMs are treated by mm(memory management) >>subsystem as >>>>> DEVICE zone, which is a virtual zone and both its start and end of pf= n >>>>> are equal to 0, mm wouldn=E2=80=99t manage NVDIMM directly as DRAM, k= ernel >>uses >>>>> corresponding drivers, which locate at \drivers\nvdimm\ and >>>>> \drivers\acpi\nfit and fs, to realize NVDIMM memory alloc and free wi= th >>>>> memory hot plug implementation. >>>> >>>> You probably want to let linux-nvdimm know about this patch set. >>>> Adding to the cc. >>> >>> Yes, thanks for that! >>> >>>> Also, I only received patch 0 and 4. What happened >>>> to 1-3,5 and 6? >>>> >>>>> With current kernel, many mm=E2=80=99s classical features like the bu= ddy >>>>> system, swap mechanism and page cache couldn=E2=80=99t be supported t= o >>NVDIMM. >>>>> What we are doing is to expand kernel mm=E2=80=99s capacity to make i= t to >>handle >>>>> NVDIMM like DRAM. Furthermore we make mm could treat DRAM and >>NVDIMM >>>>> separately, that means mm can only put the critical pages to NVDIMM >> >>Please define "critical pages." >> >>>>> zone, here we created a new zone type as NVM zone. That is to say for >>>>> traditional(or normal) pages which would be stored at DRAM scope like >>>>> Normal, DMA32 and DMA zones. But for the critical pages, which we hop= e >>>>> them could be recovered from power fail or system crash, we make them >>>>> to be persistent by storing them to NVM zone. >> >>[...] >> >>> I think adding yet one more mm-zone is the wrong direction. Instead, >>> what we have been considering is a mechanism to allow a device-dax >>> instance to be given back to the kernel as a distinct numa node >>> managed by the VM. It seems it times to dust off those patches. >> >>What's the use case? The above patch description seems to indicate an >>intent to recover contents after a power loss. Without seeing the whole >>series, I'm not sure how that's accomplished in a safe or meaningful >>way. >> >>Huaisheng, could you provide a bit more background? >> > > Currently in our mind, an ideal use scenario is that, we put all page cac= hes to > zone_nvm, without any doubt, page cache is an efficient and common cache > implement, but it has a disadvantage that all dirty data within it would = has risk > to be missed by power failure or system crash. If we put all page caches = to NVDIMMs, > all dirty data will be safe. > > And the most important is that, Page cache is different from dm-cache or = B-cache. > Page cache exists at mm. So, it has much more performance than other Writ= e > caches, which locate at storage level. Can you be more specific? I think the only fundamental performance difference between page cache and a block caching driver is that page cache pages can be DMA'ed directly to lower level storage. However, I believe that problem is solvable, i.e. we can teach dm-cache to perform the equivalent of in-kernel direct-I/O when transferring data between the cache and the backing storage when the cache is comprised of persistent memory. > > At present we have realized NVM zone to be supported by two sockets(NUMA) > product based on Lenovo Purley platform, and we can expand NVM flag into > Page Cache allocation interface, so all Page Caches of system had been st= ored > to NVDIMM safely. > > Now we are focusing how to recover data from Page cache after power on. T= hat is, > The dirty pages could be safe and the time cost of cache training would b= e saved a lot. > Because many pages have already stored to ZONE_NVM before power failture. I don't see how ZONE_NVM fits into a persistent page cache solution. All of the mm structures to maintain the page cache are built to be volatile. Once you build the infrastructure to persist and restore the state of the page cache it is no longer the traditional page cache. I.e. it will become something much closer to dm-cache or a filesystem. One nascent idea from Dave Chinner is to teach xfs how to be a block server for an upper level filesystem. His aim is sub-volume and snapshot support, but I wonder if caching could be adapted into that model? In any event I think persisting and restoring cache state needs to be designed before deciding if changes to the mm are needed.