Received: by 10.223.185.116 with SMTP id b49csp3219810wrg; Mon, 5 Mar 2018 16:51:53 -0800 (PST) X-Google-Smtp-Source: AG47ELuEGWLWFpRIMHO6cf5qZPw9OWAmgqGI/w3gELoks4hPJaWmYgU5QCRxgl10+493EqddFomF X-Received: by 2002:a17:902:6e0f:: with SMTP id u15-v6mr14601471plk.78.1520297513016; Mon, 05 Mar 2018 16:51:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1520297512; cv=none; d=google.com; s=arc-20160816; b=tP/iR6yKHz8E6syTKIH3v20eA/6/cUqJtx4FJQvueRwFih768nBjevHzXEOlzeKLQO gNWlxHnMYal6bspld8WOYYkkQsn3iYprmt/aWrVMqWcqD6R5GsQ50deKZteI/627Bdjp kF2CBIdpHJUi9LSTYgiJ834eGpfb+Vpuf7mvDcBta4YFibG0XJ6D8gME6XMpa6uZ3sE5 4pigVF0W+t1l66uevNwSxjpNfawnD3VVSVV3PswvOE5HGKEDPbCgMEDgH2HnPTZsqEfR j1MEZ2LGKAObR2GK2/dsS3qibgrXKF4T8Z+ctpIL36hGNbJAKT7S7SEy6ZwEhU2XdTBe 6mQA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=ZPLjbKm1hg8HKMBr9Qkc93EiDS1e28S2DhOxclT4FM0=; b=PLRNlXj82yAcX35AbwF06O40IhofjinXQWGLXLUJt0+b/bUj/2y8hLNkM/SAvqP2ov iZJ0L2VBgEF374zIyuF09OedPAyHGrT4JdXTKFKgMv88a+PEGnotaUkGIIbo7tYoxo2V hdIc8olHYlmF70FylpIKevmd+LQ34nj/OUfaS+O79myazCTDkb+oiaRuI+Fbro7/VOoG y34JG8PaGDj56yc5U9KBheVPkCWQW/dGtCDNrhSC15MGqyzy7dSxpr2jJKg4yvspcAeS UnpTANoF7v93p2wrmhHREgVRJb3/+X1wZP/u16kwivL2ui+uZTMVf7F6hTiMsCmSCxvU aSWw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=aHE5+FgU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s125si9119065pgc.468.2018.03.05.16.51.38; Mon, 05 Mar 2018 16:51:52 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=aHE5+FgU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933042AbeCFAtH (ORCPT + 99 others); Mon, 5 Mar 2018 19:49:07 -0500 Received: from mail-qt0-f196.google.com ([209.85.216.196]:41154 "EHLO mail-qt0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932287AbeCFAtE (ORCPT ); Mon, 5 Mar 2018 19:49:04 -0500 Received: by mail-qt0-f196.google.com with SMTP id j4so22677830qth.8; Mon, 05 Mar 2018 16:49:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=ZPLjbKm1hg8HKMBr9Qkc93EiDS1e28S2DhOxclT4FM0=; b=aHE5+FgUgnR5VatW+u0xbbOYMIqnC0ZW3wmxpY+cQI+2zwrMhijA41Ws8t2VwAydmY P32JJCb0NFM4zS5y4RJhEh6gXI/6Yxk8Lvj1xahfRVbJ31Qtsk2mBfHKYfTKbpwWLIFf smZseHV0+pXo5TlhwOiKGyqLnsFUo4unKRQByIGG3q4Wh5Y7quDhhrgEdhPY84JI+ZHQ jUrQ/wM3shtPt0aM6hLawFgGbRXvvDM0STYUgDgMJzwvmREZdh3bGhsZotts7FEXv4q8 rThI50GDfzdTDtQWXpxRdt6ZhnsfFrMkoSLyF6qdMh8aySkayh5fkPVEzY19Oq57jm7C x7CQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=ZPLjbKm1hg8HKMBr9Qkc93EiDS1e28S2DhOxclT4FM0=; b=MiWU0Lvol1LhJdi0zdZX3IicMNxmMp1puX0T0d+rO4fUT+B3jQkb3LMbtQpLw851kv 3Fm564YrVT1cpR+SRm/cbPK2FBnCPMUEUYf7LrNH5gvJr17EGEzKHHipamj3kEGaBr5z a2zcL+z2G6u/o9uraB3qTS3FT6CrRr6iRftw0GB5xK8LuiO0lvhyb4D/1M2L/j+56NrH A/UvX6GLubAkyJwQQ+Z7c01w39x7hPTsc8GNkC2zNdepWY5qeWM4Vjko4HoT71qwPAHQ GSppKqBEUwId2TTV/3y9ojCZGDr5O2BtcGkKk/NvZ2JpO2b/Xly6zuckKabgNCJWiRQj kIEg== X-Gm-Message-State: AElRT7FEqQoWwFfPyfVlSwjZp9WzZ0e9HprknyD4jhoWAH0YMCUcRFnI Ygi0l02amUc93hj4w/1jsBSD3FJvK6vAWgyPA1g= X-Received: by 10.200.55.81 with SMTP id p17mr27041352qtb.282.1520297343539; Mon, 05 Mar 2018 16:49:03 -0800 (PST) MIME-Version: 1.0 Received: by 10.140.104.143 with HTTP; Mon, 5 Mar 2018 16:49:02 -0800 (PST) In-Reply-To: <3f56c76d-6a5c-7c2f-5442-c9209749b598@deltatee.com> References: <20180228234006.21093-1-logang@deltatee.com> <20180228234006.21093-8-logang@deltatee.com> <20180305160004.GA30975@localhost.localdomain> <3f56c76d-6a5c-7c2f-5442-c9209749b598@deltatee.com> From: Oliver Date: Tue, 6 Mar 2018 11:49:02 +1100 Message-ID: Subject: Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB To: Logan Gunthorpe Cc: Keith Busch , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, "linux-nvdimm@lists.01.org" , linux-block@vger.kernel.org, Jens Axboe , Benjamin Herrenschmidt , Alex Williamson , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Jason Gunthorpe , Bjorn Helgaas , Max Gurtovoy , Christoph Hellwig Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 6, 2018 at 4:10 AM, Logan Gunthorpe wrote: > > > On 05/03/18 09:00 AM, Keith Busch wrote: >> >> On Mon, Mar 05, 2018 at 12:33:29PM +1100, Oliver wrote: >>> >>> On Thu, Mar 1, 2018 at 10:40 AM, Logan Gunthorpe >>> wrote: >>>> >>>> @@ -429,10 +429,7 @@ static void __nvme_submit_cmd(struct nvme_queue >>>> *nvmeq, >>>> { >>>> u16 tail = nvmeq->sq_tail; >>> >>> >>>> - if (nvmeq->sq_cmds_io) >>>> - memcpy_toio(&nvmeq->sq_cmds_io[tail], cmd, >>>> sizeof(*cmd)); >>>> - else >>>> - memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd)); >>>> + memcpy(&nvmeq->sq_cmds[tail], cmd, sizeof(*cmd)); >>> >>> >>> Hmm, how safe is replacing memcpy_toio() with regular memcpy()? On PPC >>> the _toio() variant enforces alignment, does the copy with 4 byte >>> stores, and has a full barrier after the copy. In comparison our >>> regular memcpy() does none of those things and may use unaligned and >>> vector load/stores. For normal (cacheable) memory that is perfectly >>> fine, but they can cause alignment faults when targeted at MMIO >>> (cache-inhibited) memory. >>> >>> I think in this particular case it might be ok since we know SEQs are >>> aligned to 64 byte boundaries and the copy is too small to use our >>> vectorised memcpy(). I'll assume we don't need explicit ordering >>> between writes of SEQs since the existing code doesn't seem to care >>> unless the doorbell is being rung, so you're probably fine there too. >>> That said, I still think this is a little bit sketchy and at the very >>> least you should add a comment explaining what's going on when the CMB >>> is being used. If someone more familiar with the NVMe driver could >>> chime in I would appreciate it. >> >> >> I may not be understanding the concern, but I'll give it a shot. >> >> You're right, the start of any SQE is always 64-byte aligned, so that >> should satisfy alignment requirements. >> >> The order when writing multiple/successive SQEs in a submission queue >> does matter, and this is currently serialized through the q_lock. >> >> The order in which the bytes of a single SQE is written doesn't really >> matter as long as the entire SQE is written into the CMB prior to writing >> that SQ's doorbell register. >> >> The doorbell register is written immediately after copying a command >> entry into the submission queue (ignore "shadow buffer" features), >> so the doorbells written to commands submitted is 1:1. >> >> If a CMB SQE and DB order is not enforced with the memcpy, then we do >> need a barrier after the SQE's memcpy and before the doorbell's writel. > > > > Thanks for the information Keith. > > Adding to this: regular memcpy generally also enforces alignment as > unaligned access to regular memory is typically bad in some way on most > arches. The generic memcpy_toio also does not have any barrier as it is just > a call to memcpy. Arm64 also does not appear to have a barrier in its > implementation and in the short survey I did I could not find any > implementation with a barrier. I also did not find a ppc implementation in > the tree but it would be weird for it to add a barrier when other arches do > not appear to need it. It's in arch/powerpc/kernel/io.c as _memcpy_toio() and it has two full barriers! Awesome! Our io.h indicates that our iomem accessors are designed to provide x86ish strong ordering of accesses to MMIO space. The git log indicates arch/powerpc/kernel/io.c has barely been touched in the last decade so odds are most of that code was written in the elder days when people were less aware of ordering issues. It might just be overly conservative by today's standards, but maybe not (see below). > We've been operating on the assumption that memory mapped by > devm_memremap_pages() can be treated as regular memory. This is emphasized > by the fact that it does not return an __iomem pointer. I think the lack of an __iomem annotation is a bit of a red herring. The comment header for devm_memremap_pages() outlines the conditions for when using it is valid, the important one being: > * 4/ res is expected to be a host memory range that could feasibly be > * treated as a "System RAM" range, i.e. not a device mmio range, but > * this is not enforced. Granted, the CMB is a MMIO buffer rather than a register range, but from my perspective it's still a device MMIO range rather than something we could feasibly treat as system RAM. The two big reasons being: a) System RAM is cacheable and coherent while the CMB is neither, and b) On PPC MMIO accesses are in a separate ordering domain to cacheable memory accesses. The latter isn't as terrifying as it sounds since a full barrier will order everything with everything, but it still causes problems. For performance reasons we want to minimise the number of cases where we need to use a sync instruction (full barrier) in favour of a lightweight sync (lwsync) which only orders cacheable accesses. To implement that our MMIO accessors set a percpu flag that arch_spin_unlock() checks to see if any MMIO accesses have occured which would require a full barrier. If we havn't done any MMIO operations then it'll only do a lwsync. My concern here is that a driver might do something like this: void *buffer = devm_memremap_pages(); spin_lock(); // buffer is in MMIO space, so the writes are uncached memcpy(buffer, source_data, ); spin_unlock(); // no MMIO access detected, so it does a lwsync As far as I know the spinlock API guarantees that accesses that occurring inside the critical section will be ordered relative to what happens outside the critical section. In this case we would be violating that guarantee and I don't see a clean way to keep the fix for it contained to arch/powerpc/. Fundamentally we need to know when something is accessing MMIO space. (I'm not going to suggest ditching the lwsync trick. mpe is not going to take that patch without a really good reason) >If this assumption > does not hold for an arch then we cannot support P2P DMA without an overhaul > of many kernel interfaces or creating other backend interfaces into the > drivers which take different data types (ie. we'd have to bypass the entire > block layer when trying to write data in p2pmem to an nvme device. This is > very undesirable. I know. I'm not trying to ruin your day, but the separation between io and normal memory is there for a reason. Maybe we can relax that separation, but we need to be careful about how we do it. Oliver