Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1946229AbbDXQD7 (ORCPT ); Fri, 24 Apr 2015 12:03:59 -0400 Received: from resqmta-po-11v.sys.comcast.net ([96.114.154.170]:52950 "EHLO resqmta-po-11v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965348AbbDXQDz (ORCPT ); Fri, 24 Apr 2015 12:03:55 -0400 Date: Fri, 24 Apr 2015 11:03:52 -0500 (CDT) From: Christoph Lameter X-X-Sender: cl@gentwo.org To: Jerome Glisse cc: Benjamin Herrenschmidt , paulmck@linux.vnet.ibm.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, jglisse@redhat.com, mgorman@suse.de, aarcange@redhat.com, riel@redhat.com, airlied@redhat.com, aneesh.kumar@linux.vnet.ibm.com, Cameron Buschardt , Mark Hairgrove , Geoffrey Gerfin , John McKenna , akpm@linux-foundation.org Subject: Re: Interacting with coherent memory on external devices In-Reply-To: <20150424150829.GA3840@gmail.com> Message-ID: References: <1429663372.27410.75.camel@kernel.crashing.org> <20150422005757.GP5561@linux.vnet.ibm.com> <1429664686.27410.84.camel@kernel.crashing.org> <20150422163135.GA4062@gmail.com> <1429756456.4915.22.camel@kernel.crashing.org> <20150423161105.GB2399@gmail.com> <20150424150829.GA3840@gmail.com> Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3345 Lines: 66 On Fri, 24 Apr 2015, Jerome Glisse wrote: > On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote: > > On Thu, 23 Apr 2015, Jerome Glisse wrote: > > > > > No this not have been solve properly. Today solution is doing an explicit > > > copy and again and again when complex data struct are involve (list, tree, > > > ...) this is extremly tedious and hard to debug. So today solution often > > > restrict themself to easy thing like matrix multiplication. But if you > > > provide a unified address space then you make things a lot easiers for a > > > lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry > > > standard is a proof that unified address space is one of the most important > > > feature requested by user of GPGPU. You might not care but the rest of the > > > world does. > > > > You could use page tables on the kernel side to transfer data on demand > > from the GPU. And you can use a device driver to establish mappings to the > > GPUs memory. > > > > There is no copy needed with these approaches. > > So you are telling me to do get_user_page() ? If so you aware that this pins > memory ? So what happens when the GPU wants to access a range of 32GB of > memory ? I pin everything ? Use either a device driver to create PTEs pointing to the data or do something similar like what DAX does. Pinning can be avoided if you use mmu_notifiers. Those will give you a callback before the OS removes the data and thus you can operate without pinning. > Overall the throughput of the GPU will stay close to its theoritical maximum > if you have enough other thread that can progress and this is very common. GPUs operate on groups of threads not single ones. If you stall then there will be a stall of a whole group of them. We are dealing with accellerators here that are different for performance reasons. They are not to be treated like regular processor, nor is memory like operating like host mmemory. > But IBM here want to go further and to provide a more advance solution, > so their need are specific to there platform and we can not know if AMD, > ARM or Intel will want to go down the same road, they do not seem to be > interested. Does it means we should not support IBM ? I think it would be > wrong. What exactly is the more advanced version's benefit? What are the features that the other platforms do not provide? > > This sounds more like a case for a general purpose processor. If it is a > > special device then it will typically also have special memory to allow > > fast searches. > > No this kind of thing can be fast on a GPU, with GPU you easily have x500 > more cores than CPU cores, so you can slice the dataset even more and have > each of the GPU core perform the search. Note that i am not only thinking > of stupid memcmp here it can be something more complex like searching a > pattern that allow variation and that require a whole program to decide if > a chunk falls under the variation rules or not. Then you have the problem of fast memory access and you are proposing to complicate that access path on the GPU. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/