Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966931AbbDXQ6q (ORCPT ); Fri, 24 Apr 2015 12:58:46 -0400 Received: from resqmta-ch2-10v.sys.comcast.net ([69.252.207.42]:35427 "EHLO resqmta-ch2-10v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964983AbbDXQ6n (ORCPT ); Fri, 24 Apr 2015 12:58:43 -0400 Date: Fri, 24 Apr 2015 11:58:39 -0500 (CDT) From: Christoph Lameter X-X-Sender: cl@gentwo.org To: Jerome Glisse cc: Benjamin Herrenschmidt , paulmck@linux.vnet.ibm.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, jglisse@redhat.com, mgorman@suse.de, aarcange@redhat.com, riel@redhat.com, airlied@redhat.com, aneesh.kumar@linux.vnet.ibm.com, Cameron Buschardt , Mark Hairgrove , Geoffrey Gerfin , John McKenna , akpm@linux-foundation.org Subject: Re: Interacting with coherent memory on external devices In-Reply-To: <20150424164325.GD3840@gmail.com> Message-ID: References: <1429664686.27410.84.camel@kernel.crashing.org> <20150422163135.GA4062@gmail.com> <1429756456.4915.22.camel@kernel.crashing.org> <20150423161105.GB2399@gmail.com> <20150424150829.GA3840@gmail.com> <20150424164325.GD3840@gmail.com> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3252 Lines: 68 On Fri, 24 Apr 2015, Jerome Glisse wrote: > > What exactly is the more advanced version's benefit? What are the features > > that the other platforms do not provide? > > Transparent access to device memory from the CPU, you can map any of the GPU > memory inside the CPU and have the whole cache coherency including proper > atomic memory operation. CAPI is not some mumbo jumbo marketing name there > is real hardware behind it. Got the hardware here but I am getting pretty sobered given what I heard here. The IBM mumbo jumpo marketing comes down to "not much" now. > On x86 you have to take into account the PCI bar size, you also have to take > into account that PCIE transaction are really bad when it comes to sharing > memory with CPU. CAPI really improve things here. Ok that would be interesting for the general device driver case. Can you show a real performance benefit here of CAPI transactions vs. PCI-E transactions? > So on x86 even if you could map all the GPU memory it would still be a bad > solution and thing like atomic memory operation might not even work properly. That is solvable and doable in many other ways if needed. Actually I'd prefer a Xeon Phi in that case because then we also have the same instruction set. Having locks work right with different instruction sets and different coherency schemes. Ewww... > > Then you have the problem of fast memory access and you are proposing to > > complicate that access path on the GPU. > > No, i am proposing to have a solution where people doing such kind of work > load can leverage the GPU, yes it will not be as fast as people hand tuning > and rewritting their application for the GPU but it will still be faster > by a significant factor than only using the CPU. Well the general purpose processors also also gaining more floating point capabilities which increases the pressure on accellerators to become more specialized. > Moreover i am saying that this can happen without even touching a single > line of code of many many applications, because many of them rely on library > and those are the only one that would need to know about GPU. Yea. We have heard this numerous times in parallel computing and it never really worked right. > Finaly i am saying that having a unified address space btw the GPU and CPU > is a primordial prerequisite for this to happen in a transparent fashion > and thus DAX solution is non-sense and does not provide transparent address > space sharing. DAX solution is not even something new, this is how today > stack is working, no need for DAX, userspace just mmap the device driver > file and that's how they access the GPU accessible memory (which in most > case is just system memory mapped through the device file to the user > application). Right this is how things work and you could improve on that. Stay with the scheme. Why would that not work if you map things the same way in both environments if both accellerator and host processor can acceess each others memory? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/