Received: by 2002:ad5:4acb:0:0:0:0:0 with SMTP id n11csp3003375imw; Sun, 17 Jul 2022 23:19:47 -0700 (PDT) X-Google-Smtp-Source: AGRyM1uzoIsyWHkmXkXns4B/a5zoq4ttV6ttKBq8/PvZdFnGLBwBP3itwRrimY/P78IEFewJRJN+ X-Received: by 2002:a62:b60d:0:b0:528:99aa:da09 with SMTP id j13-20020a62b60d000000b0052899aada09mr26718605pff.86.1658125187132; Sun, 17 Jul 2022 23:19:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1658125187; cv=none; d=google.com; s=arc-20160816; b=T0tzJSEX2O3/dRgJjWGqvmpL5YXjMauHEftH44RDrjMPnQ8y0bndiEGdvsOs2g/FIJ Fo0x7UKh/oTpr8xAA0vlHelQMvlixp3/CnE1evnsd3naNl8nWcGmZhRm8mdCvkW330sT 24P2jLaZM+sy6Edt+2suGVxD+mTw0NgSAZzXK7qEjPVXZ1BVXsjuoLKznZXyOZHcN9uk PmI2plSCAKy7wouwflFJt4HR2XblmbRKdwCexDWGd+jnH1OCVPxfs+B0VeDeLRwance+ LnG7pHY6vkkvS8FQH8Q+KcVE326lWp+4RFAqNESYN/FGd566zS+2fSqHgMKe/DPZ2j9p b0XA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:in-reply-to :date:references:subject:cc:to:from:dkim-signature; bh=5oM5osu4iAaHiwzA/fBuFmFVezMI620rftmdjqS1nE8=; b=v+eBU+YzJrIXothamuKo2epwsR7L7JdBV3deh7TfN0J5Jp1YnLwxfiel24vHP3QaHa o6xLi5u62osFATSPK8+vFzTA6NnM2yghZdQvCj6qC+nFMseqB36t8GlmYju4MgNM8qdl spzyrd09vuA0ubQ0wV7HYp14HfznHHz9uxOkhPTTHjGAgICb3CRyb7HPeO/wmd4f/X4q ljnNh4Y9RN0q62CPkEpVTJtwk7aIGnFJi7+TQvrBpHsHYjW5sM4BYUBhmPM39t3OvHL4 q3PgZ2Tx3faceqmy6lduqw4XZ4ohX60dT3SsLj9UktcKp2XBbfvAT+lKe3TXBXih9vyv 1dIg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=NdV1h5Fq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i6-20020a625406000000b0051bf4952487si14381044pfb.101.2022.07.17.23.19.31; Sun, 17 Jul 2022 23:19:47 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=NdV1h5Fq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233328AbiGRGIU (ORCPT + 99 others); Mon, 18 Jul 2022 02:08:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36094 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233274AbiGRGIT (ORCPT ); Mon, 18 Jul 2022 02:08:19 -0400 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E7CD513D55 for ; Sun, 17 Jul 2022 23:08:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1658124498; x=1689660498; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=Iy/C5nIVHEHmLHZUhJ378y+kBUGWLBxvC8bLlhAGmqQ=; b=NdV1h5FqFZ3UXKpgpG2vNpy80V5JSSvZUs3JW1/NLfWBM2TY+KEx74SJ FHgNqKppSf05glbUXsvccWWV6w1rk/d5ypdhXxY8qwBOvXLXSElMXkE9q w3OuUujy8sMbblab2jXmmVLFCLfqShkePoEGJV/m5d5S8koUhB5HpJf3y /OtEZYeCpVY4fh5KqAXwDwVxLUEVEpUpXOhFbsqCrnbU1gR2IzOb9+yk7 D47+xmjR5mP1B1dGOqrcrDy6NKAu/ZZOF2f8hElZppZNPofvlm5lwOsrd /gadLPCCb1QdVNg/vRu8zo9yaaHubXN6TFP9EXgTgKWDnPpm3WN/i68O0 g==; X-IronPort-AV: E=McAfee;i="6400,9594,10411"; a="347831482" X-IronPort-AV: E=Sophos;i="5.92,280,1650956400"; d="scan'208";a="347831482" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jul 2022 23:08:18 -0700 X-IronPort-AV: E=Sophos;i="5.92,280,1650956400"; d="scan'208";a="572269840" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.239.13.94]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jul 2022 23:08:14 -0700 From: "Huang, Ying" To: "Aneesh Kumar K.V" Cc: linux-mm@kvack.org, akpm@linux-foundation.org, Wei Xu , Yang Shi , Davidlohr Bueso , Tim C Chen , Michal Hocko , Linux Kernel Mailing List , Hesham Almatary , Dave Hansen , Jonathan Cameron , Alistair Popple , Dan Williams , Johannes Weiner , jvgediya.oss@gmail.com, Jagdish Gediya Subject: Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers References: <20220714045351.434957-1-aneesh.kumar@linux.ibm.com> <20220714045351.434957-2-aneesh.kumar@linux.ibm.com> <87bktq4xs7.fsf@yhuang6-desk2.ccr.corp.intel.com> <3659f1bb-a82e-1aad-f297-808a2c17687d@linux.ibm.com> <87sfn2u0vy.fsf@linux.ibm.com> Date: Mon, 18 Jul 2022 14:08:11 +0800 In-Reply-To: <87sfn2u0vy.fsf@linux.ibm.com> (Aneesh Kumar K. V.'s message of "Fri, 15 Jul 2022 15:57:13 +0530") Message-ID: <87y1wr2bsk.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Spam-Status: No, score=-5.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org "Aneesh Kumar K.V" writes: > Aneesh Kumar K V writes: > > .... > >> >>> You dropped the original sysfs interface patches from the series, but >>> the kernel internal implementation is still for the original sysfs >>> interface. For example, memory tier ID is for the original sysfs >>> interface, not for the new proposed sysfs interface. So I suggest you >>> to implement with the new interface in mind. What do you think about >>> the following design? >>> >> >> Sorry I am not able to follow you here. This patchset completely drops >> exposing memory tiers to userspace via sysfs. Instead it allow >> creation of memory tiers with specific tierID from within the kernel/device driver. >> Default tierID is 200 and dax kmem creates memory tier with tierID 100. >> >> >>> - Each NUMA node belongs to a memory type, and each memory type >>> corresponds to a "abstract distance", so each NUMA node corresonds to >>> a "distance". For simplicity, we can start with static distances, for >>> example, DRAM (default): 150, PMEM: 250. The distance of each NUMA >>> node can be recorded in a global array, >>> >>> int node_distances[MAX_NUMNODES]; >>> >>> or, just >>> >>> pgdat->distance >>> >> >> I don't follow this. I guess you are trying to have a different design. >> Would it be much easier if you can write this in the form of a patch? >> >> >>> - Each memory tier corresponds to a range of distance, for example, >>> 0-100, 100-200, 200-300, >300, we can start with static ranges too. >>> >>> - The core API of memory tier could be >>> >>> struct memory_tier *find_create_memory_tier(int distance); >>> >>> it will find the memory tier which covers "distance" in the memory >>> tier list, or create a new memory tier if not found. >>> >> >> I was expecting this to be internal to dax kmem. How dax kmem maps >> "abstract distance" to a memory tier. At this point this patchset is >> keeping all that for a future patchset. >> > > This shows how i was expecting "abstract distance" to be integrated. > Thanks! To make the first version as simple as possible, I think we can just use some static "abstract distance" for dax_kmem, e.g., 250. Because we use it for PMEM only now. We can enhance dax_kmem later. IMHO, we should make the core framework correct firstly. - A device driver should report the capability (or performance level) of the hardware to the memory tier core via abstract distance. This can be done via some global data structure (e.g. node_distances[]) at least in the first version. - Memory tier core determines the mapping from the abstract distance to the memory tier via abstract distance ranges, and allocate the struct memory_tier when necessary. That is, memory tier core determines whether to allocate or reuse which memory tier for NUMA nodes, not device drivers. - It's better to place the NUMA node to the correct memory tier in the fist place. We should avoid to place the PMEM node in the default tier, then change it to the correct memory tier. That is, device drivers should report the abstract distance before onlining NUMA nodes. Please check my reply to Wei too about my other suggestions for the first version. Best Regards, Huang, Ying