Received: by 2002:ab2:7903:0:b0:1fb:b500:807b with SMTP id a3csp1114837lqj; Mon, 3 Jun 2024 10:29:12 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCU1ehKLcyQHH2XGbeM2yq/dn6m5OhwZ2j/Q23XKSWZmi+waSkFD1lxM3RADsNLrPQnyjiBCnSfC/nMB/6NmODXkOy5OP5+LV3iZfYC7xQ== X-Google-Smtp-Source: AGHT+IF6uhq1UuD/KE7Oy4fFD3fQp50U9GdTMwehsWrU4sbEK9aoF1ilaeHNofFeoRgvRl0t6HiW X-Received: by 2002:a05:620a:2093:b0:792:c14e:7d08 with SMTP id af79cd13be357-794f5ed5a12mr1019072285a.78.1717435752066; Mon, 03 Jun 2024 10:29:12 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1717435752; cv=pass; d=google.com; s=arc-20160816; b=ZH9kZgXvSBq0D2vmcLxgB/LHZyr3lul3AlZJOoKCG2pLC/ZKdg2fquRSLBRrxsJb0J ZQsCdIdACFQGPuqJF8mJ3sm2ck8gLNRwLYv/lyg0iuviOAaCsZX1fxrsRnVgU3NkYG/a 8wkrnhvT1CLETdVZKi4wscHw0PBxqE/78ZGxsaNYcm+4Go6gu4sSDZ23EleQtA2a7TLy erx0///E1JYX6/xLHnVyPnNOKblJHN8kPwlWLhp1ef5f5v5rIOe3/aAH9X8Iy2AmAdE7 vI5tuKCsx81/vofkFlCHgiOlD6N9V0ScJCt0R0IWGkNRGLucXSyZ99jb+FQ3iZw5238D vQlA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=EAuodYLhtpLki0VgY2Krg5E7Pez0StIm5MuGSSJWWLE=; fh=socWeGdE+1+04vlUrwIjnEMw7ncxTY62XU6oKGYFIuw=; b=vHA0fyn8lMk2oSJqcW8l85auYxhXnKAPjBsQy+/4cZpV1LL9hIRTxy2MMH+HxKOQIv /DrL7rgQ/xy+fedoyfHppNI969L08DQMeRDbyyWwMZ7Fp/qQ8qYWGJi9ZBT7Z6QXsLQw 6lZI7pbhSKyPpU58INSNOpW/nAzUsrCQReH8oDN0DovpRjUOStwBTy5jcRPxpSX9BwsS Pg//oge995vXeyVhsgZHB346AwhU5DaNkcZHz1x+tx7jAAHEF+pvO3sGjrH0/ve66Lco OGfrpgBNz2yCR2Q2Y5sSxkI6zLddZvnybS2gCJXSw9rF2iSSGKknLsPyAPRxxtDmGJlN UncA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-199508-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-199508-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id af79cd13be357-794f3062fe9si969358985a.381.2024.06.03.10.29.11 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 03 Jun 2024 10:29:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-199508-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-199508-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-199508-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id BB1E51C21156 for ; Mon, 3 Jun 2024 17:29:11 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 722CD137763; Mon, 3 Jun 2024 17:29:01 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 8329925622; Mon, 3 Jun 2024 17:28:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717435740; cv=none; b=OZldw19/ZAPmOerXulL5apL/YAWxcnxoyZ45ATX+JXhZNnmPpqT8mw6H6NmBputhS5MJceh9zTlENLQ/wAE14xnpU28kOinj00TS/UPt3/WwAej27lt5wirY8f74vqwRaOIw7h/tWwBe7jWPWepat1KWAqDnwuSlSyAkm4XskZM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717435740; c=relaxed/simple; bh=+tB52iHPbmCXMHOeWlpQnCHMtEmXi/u7YpuslWqxPqk=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=e5jFvk8FbLlLblVcgrECOFRVqzgUVW0EJpZMxo+Zxa0U3QzQIrrna/h/Bjz5PAskvMTqZkClc+dczPxjWvEKScJ9JXrTvCg7x7LoeJtaZIcyebBVSSCfQoyJ1NwcO6gLNnKcrvZ6U8gXQvSqWFWTHAOBWIfgsK4VKW9/Ymuq8rc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id F1A151042; Mon, 3 Jun 2024 10:29:21 -0700 (PDT) Received: from [10.1.196.28] (eglon.cambridge.arm.com [10.1.196.28]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 3B2A83F792; Mon, 3 Jun 2024 10:28:56 -0700 (PDT) Message-ID: <3c7c9b07-78b2-4b8d-968e-0c395c8f22b3@arm.com> Date: Mon, 3 Jun 2024 18:28:51 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device) Content-Language: en-GB To: Jonathan Cameron , Dan Williams Cc: Dongsheng Yang , Gregory Price , John Groves , axboe@kernel.dk, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, nvdimm@lists.linux.dev, Mark Rutland References: <20240508164417.00006c69@Huawei.com> <3d547577-e8f2-8765-0f63-07d1700fcefc@easystack.cn> <20240509132134.00000ae9@Huawei.com> <664cead8eb0b6_add32947d@dwillia2-mobl3.amr.corp.intel.com.notmuch> <8f161b2d-eacd-ad35-8959-0f44c8d132b3@easystack.cn> <5db870de-ecb3-f127-f31c-b59443b4fbb4@easystack.cn> <20240530143813.00006def@Huawei.com> <665a9402445ee_166872941d@dwillia2-mobl3.amr.corp.intel.com.notmuch> <20240603134819.00001c5f@Huawei.com> From: James Morse In-Reply-To: <20240603134819.00001c5f@Huawei.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Hi guys, On 03/06/2024 13:48, Jonathan Cameron wrote: > On Fri, 31 May 2024 20:22:42 -0700 > Dan Williams wrote: >> Jonathan Cameron wrote: >>> On Thu, 30 May 2024 14:59:38 +0800 >>> Dongsheng Yang wrote: >>>> 在 2024/5/29 星期三 下午 11:25, Gregory Price 写道: >>>>> It's not just a CXL spec issue, though that is part of it. I think the >>>>> CXL spec would have to expose some form of puncturing flush, and this >>>>> makes the assumption that such a flush doesn't cause some kind of >>>>> race/deadlock issue. Certainly this needs to be discussed. >>>>> >>>>> However, consider that the upstream processor actually has to generate >>>>> this flush. This means adding the flush to existing coherence protocols, >>>>> or at the very least a new instruction to generate the flush explicitly. >>>>> The latter seems more likely than the former. >>>>> >>>>> This flush would need to ensure the data is forced out of the local WPQ >>>>> AND all WPQs south of the PCIE complex - because what you really want to >>>>> know is that the data has actually made it back to a place where remote >>>>> viewers are capable of percieving the change. >>>>> >>>>> So this means: >>>>> 1) Spec revision with puncturing flush >>>>> 2) Buy-in from CPU vendors to generate such a flush >>>>> 3) A new instruction added to the architecture. >>>>> >>>>> Call me in a decade or so. >>>>> >>>>> >>>>> But really, I think it likely we see hardware-coherence well before this. >>>>> For this reason, I have become skeptical of all but a few memory sharing >>>>> use cases that depend on software-controlled cache-coherency. >>>> >>>> Hi Gregory, >>>> >>>> From my understanding, we actually has the same idea here. What I am >>>> saying is that we need SPEC to consider this issue, meaning we need to >>>> describe how the entire software-coherency mechanism operates, which >>>> includes the necessary hardware support. Additionally, I agree that if >>>> software-coherency also requires hardware support, it seems that >>>> hardware-coherency is the better path. >>>>> >>>>> There are some (FAMFS, for example). The coherence state of these >>>>> systems tend to be less volatile (e.g. mappings are read-only), or >>>>> they have inherent design limitations (cacheline-sized message passing >>>>> via write-ahead logging only). >>>> >>>> Can you explain more about this? I understand that if the reader in the >>>> writer-reader model is using a readonly mapping, the interaction will be >>>> much simpler. However, after the writer writes data, if we don't have a >>>> mechanism to flush and invalidate puncturing all caches, how can the >>>> readonly reader access the new data? >>> >>> There is a mechanism for doing coarse grained flushing that is known to >>> work on some architectures. Look at cpu_cache_invalidate_memregion(). >>> On intel/x86 it's wbinvd_on_all_cpu_cpus() >> >> There is no guarantee on x86 that after cpu_cache_invalidate_memregion() >> that a remote shared memory consumer can be assured to see the writes >> from that event. > > I was wondering about that after I wrote this... I guess it guarantees > we won't get a late landing write or is that not even true? > > So if we remove memory, then added fresh memory again quickly enough > can we get a left over write showing up? I guess that doesn't matter as > the kernel will chase it with a memset(0) anyway and that will be ordered > as to the same address. > > However we won't be able to elide that zeroing even if we know the device > did it which is makes some operations the device might support rather > pointless :( >>> on arm64 it's a PSCI firmware call CLEAN_INV_MEMREGION (there is a >>> public alpha specification for PSCI 1.3 with that defined but we >>> don't yet have kernel code.) I have an RFC for that - but I haven't had time to update and re-test it. If you need this, and have a platform where it can be implemented, please get in touch with the people that look after the specs to move it along from alpha. >> That punches visibility through CXL shared memory devices? > It's a draft spec and Mark + James in +CC can hopefully confirm. > It does say > "Cleans and invalidates all caches, including system caches". > which I'd read as meaning it should but good to confirm. It's intended to remove any cached entries - including lines in what the arm-arm calls "invisible" system caches, which typically only platform firmware can touch. The next access should have to go all the way to the media. (I don't know enough about CXL to say what a remote shared memory consumer observes) Without it, all we have are the by-VA operations which are painfully slow for large regions, and insufficient for system caches. As with all those firmware interfaces - its for the platform implementer to wire up whatever is necessary to remove cached content for the specified range. Just because there is an (alpha!) spec doesn't mean it can be supported efficiently by a particular platform. >>> These are very big hammers and so unsuited for anything fine grained. You forgot really ugly too! >>> In the extreme end of possible implementations they briefly stop all >>> CPUs and clean and invalidate all caches of all types. So not suited >>> to anything fine grained, but may be acceptable for a rare setup event, >>> particularly if the main job of the writing host is to fill that memory >>> for lots of other hosts to use. >>> >>> At least the ARM one takes a range so allows for a less painful >>> implementation. That is to allow some ranges to fail. (e.g. you can do this to the CXL windows, but not the regular DRAM). On the less painful implementation, arm's interconnect has a gadget that does "Address based flush" which could be used here. I'd hope platforms with that don't need to interrupt all CPUs - but it depends on what else needs to be done. >>> I'm assuming we'll see new architecture over time >>> but this is a different (and potentially easier) problem space >>> to what you need. >> >> cpu_cache_invalidate_memregion() is only about making sure local CPU >> sees new contents after an DPA:HPA remap event. I hope CPUs are able to >> get away from that responsibility long term when / if future memory >> expanders just issue back-invalidate automatically when the HDM decoder >> configuration changes. > > I would love that to be the way things go, but I fear the overheads of > doing that on the protocol means people will want the option of the painful > approach. Thanks, James