Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp3726938imm; Thu, 17 May 2018 13:41:29 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqRuxUkZSGszRooN1DeTYoftICJ7hT2PUFM7jH4UbI7/rB8NjtH/nKqkuTqf8f3susLY1px X-Received: by 2002:a63:6a08:: with SMTP id f8-v6mr5185149pgc.363.1526589689187; Thu, 17 May 2018 13:41:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526589689; cv=none; d=google.com; s=arc-20160816; b=T2UIPoWlCJ1KdPQh1L6COeqFACtRtYa7JD/PJu7YlXXEOW8wuKk9Tf7P4kNvx94KAC FcjReT/5FiTtMv6cQpDj1cgwYUvqAMtSikqR3ic6uamQQCN8SqJYwEK0pwD8tlJ5Cptf E9iS4mEMgH0/mgxVRU54mDRchFQzfbFeC0+kBac0eIsEJXmaW447Y1QXLT0p476GOgsd ft3LV1ULfTgW5IUBECaQVhn3sj78Wgk6FnhPvoGQW1MAARnjua1UkfyuNTNdEVSMRghj Uow53Z1qDdnDD21zUgSJ2PMh0/L/GKP6kxZjuaZ6oUUL/j8ctzL3L+qXpe3/sc8Svr66 0/Vg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=yTZbd2WTIl+CFQoTxZGEwIZ4wsMM6jMQSB2rz5NTTkA=; b=BullMOtMAbFpG7AGyDJPMYuqLAXbvj+QtJwiRnhOv+J4iGjeXwZkU2n1CL+FN4bsrW wr9Z/SO+/j3oVRfxz/PI5tnDmrZlC/8FDNb9MSX8UY0drOGYvM7xizWdbXVScgLkG178 0FXEItwqza0vaNS5QJJxfha/idS20xXWmRILvQuJaaNhyYne5yhsikRg3+P+aCpb8gK/ yBWk6WC9U96cm1nFqjK+XXR67iDcOnFSsStQosrWBS4yrr8MRxKlEuGqLmwnnSvZ+Yhi bPBjjse8OMAVSM1XDLtnGZwCNetx+Y/81k/4JIuDZcCK9vimWVFtKHLLY0xKiVuKqV4s 23dQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=ukPe828R; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a23-v6si5713534pfe.364.2018.05.17.13.41.14; Thu, 17 May 2018 13:41:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=ukPe828R; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751965AbeEQUlG (ORCPT + 99 others); Thu, 17 May 2018 16:41:06 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:56868 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751642AbeEQUlE (ORCPT ); Thu, 17 May 2018 16:41:04 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=yTZbd2WTIl+CFQoTxZGEwIZ4wsMM6jMQSB2rz5NTTkA=; b=ukPe828RCrh0NCoCF+tbILdUR zh8E4N7hpLSm74TYIbCdjiSz1CYAxboCKmflVocCTgX3xOnP5nIwkCSCErTk84bharg0WMbgsnwES WrSC36I6CPZgcCzeVrkJaqcQQqd0NQR9gksUKfPmvuSoYpYmgoJUa5KyTj7VjNdSPWPFzsPydDdjC HQDQdRuuiVQX9BsQX15y0+/mrHkoD5b+EwJtlE8uH3buj05hyqxobpI6eKTIMnYEo2bvwQP6Kc6px h9j1cmzSc/ELo3qfB3y7oUyz4XAGIkYtjPLuUpFFelv3ow8BTKqk2j55rruJrTCACPnt0ynezeeBh oTRUbkmAQ==; Received: from willy by bombadil.infradead.org with local (Exim 4.90_1 #2 (Red Hat Linux)) id 1fJPhv-00063q-7U; Thu, 17 May 2018 20:41:03 +0000 Date: Thu, 17 May 2018 13:41:03 -0700 From: Matthew Wilcox To: Sinan Kaya Cc: linux-mm@kvack.org, timur@codeaurora.org, linux-arm-msm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, open list Subject: Re: [PATCH] mm/dmapool: localize page allocations Message-ID: <20180517204103.GJ26718@bombadil.infradead.org> References: <1526578581-7658-1-git-send-email-okaya@codeaurora.org> <20180517181815.GC26718@bombadil.infradead.org> <9844a638-bc4e-46bd-133e-0c82a3e9d6ea@codeaurora.org> <20180517194612.GG26718@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 17, 2018 at 04:05:45PM -0400, Sinan Kaya wrote: > On 5/17/2018 3:46 PM, Matthew Wilcox wrote: > >> Remember that the CPU core that is running this driver is most probably on > >> the same NUMA node as the device itself. > > Umm ... says who? If my process is running on NUMA node 5 and I submit > > an I/O, it should be allocating from a pool on node 5, not from a pool > > on whichever node the device is attached to. > > OK, let's do an exercise. Maybe, I'm missing something in the big picture. Sure. > If a user process is running at node 5, it submits some work to the hardware > via block layer that is eventually invoked by syscall. > > Whatever buffer process is using, it gets copied into the kernel space as > it is crossing a userspace/kernel space boundary. > > Block layer packages a block request with the kernel pointers and makes a > request to the NVMe driver for consumption. > > Last time I checked, dma_alloc_coherent() API uses the locality information > from the device not from the CPU for allocation. Yes, it does. I wonder why that is; it doesn't actually make any sense. It'd be far more sensible to allocate it on memory local to the user than memory local to the device. > While the metadata for dma_pool is pointing to the currently running CPU core, > the DMA buffer itself is created using the device node itself today without > my patch. Umm ... dma_alloc_coherent memory is for metadata about the transfer, not for the memory used for the transaction. > I would think that you actually want to run the process at the same NUMA node > as the CPU and device itself for performance reasons. Otherwise, performance > expectations should be low. That's foolish. Consider a database appliance with four sockets, each with its own memory and I/O devices attached. You can't tell the user to shard the database into four pieces and have each socket only work on the quarter of the database that's available to each socket. They may as well buy four smaller machines. The point of buying a large NUMA machine is to use all of it. Let's try a different example. I have a four-socket system with one NVMe device with lots of hardware queues. Each CPU has its own queue assigned to it. If I allocate all the PRP metadata on the socket with the NVMe device attached to it, I'm sending a lot of coherency traffic in the direction of that socket, in addition to the actual data. If the PRP lists are allocated randomly on the various sockets, the traffic is heading all over the fabric. If the PRP lists are allocated on the local socket, the only time those lists move off this node is when the device requests them.