Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030848AbbD1RUw (ORCPT ); Tue, 28 Apr 2015 13:20:52 -0400 Received: from mail-qg0-f51.google.com ([209.85.192.51]:33658 "EHLO mail-qg0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030347AbbD1RUt (ORCPT ); Tue, 28 Apr 2015 13:20:49 -0400 Date: Tue, 28 Apr 2015 13:20:39 -0400 From: Jerome Glisse To: Christoph Lameter Cc: "Paul E. McKenney" , Benjamin Herrenschmidt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, jglisse@redhat.com, mgorman@suse.de, aarcange@redhat.com, riel@redhat.com, airlied@redhat.com, aneesh.kumar@linux.vnet.ibm.com, Cameron Buschardt , Mark Hairgrove , Geoffrey Gerfin , John McKenna , akpm@linux-foundation.org Subject: Re: Interacting with coherent memory on external devices Message-ID: <20150428172035.GA11810@gmail.com> References: <20150425114633.GI5561@linux.vnet.ibm.com> <20150427154728.GA26980@gmail.com> <20150427164325.GB26980@gmail.com> <20150427172143.GC26980@gmail.com> <20150427205206.GD26980@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4496 Lines: 91 On Tue, Apr 28, 2015 at 09:18:55AM -0500, Christoph Lameter wrote: > On Mon, 27 Apr 2015, Jerome Glisse wrote: > > > > is the mechanism that DAX relies on in the VM. > > > > Which would require fare more changes than you seem to think. First using > > MIXED|PFNMAP means we loose any kind of memory accounting and forget about > > memcg too. Seconds it means we would need to set those flags on all vma, > > which kind of point out that something must be wrong here. You will also > > need to have vm_ops for all those vma (including for anonymous private vma > > which sounds like it will break quite few place that test for that). Then > > you have to think about vma that already have vm_ops but you would need > > to override it to handle case where its device memory and then forward > > other case to the existing vm_ops, extra layering, extra complexity. > > These vmas would only be used for those section of memory that use > memory in the coprocessor. Special memory accounting etc can be done at > the device driver layer. Multiple processes would be able to use different > GPU contexts (or devices) which provides proper isolations. > > memcg is about accouting for regular memory and this is not regular > memory. It ooks like one would need a lot of special casing in > the VM if one wanted to handle f.e. GPU memory as regular memory under > Linux. Well i shoed this does not need much changes refer to : http://lwn.net/Articles/597289/ More specifically : http://thread.gmane.org/gmane.linux.kernel.mm/116584 http://thread.gmane.org/gmane.linux.kernel.mm/116584 http://thread.gmane.org/gmane.linux.kernel.mm/116584 Idea here is that even if device memory is speciak kind of memory we still want to account it properly against process ie an anonymous page that is on the device memory would still be accounted as regular anonymous page for memcg (same apply to file backed pages). With that existing memcg keeps working as intended and process memory use are properly accounted. This does not prevent the device driver to perform its own accounting of device memory and to allow or block migration for a given process. At this point we do not think it is meaningfull to move such accounting to a common layer. Bottom line is, we want to keep existing memcg accounting intact and we want to reflect remote memory as regular memory. Note that the memcg changes would be even smaller now that Johannes cleaned up and simplified memcg. I have not rebase that part of HMM yet. > > > I think at this point there is nothing more to discuss here. It is pretty > > clear to me that any solution using block device/MIXEDMAP would be far > > more complex and far more intrusive. I do not mind being prove wrong but > > i will certainly not waste my time trying to implement such solution. > > The device driver method is the current solution used by the GPUS and > that would be the natural starting point for development. And they do not > currently add code to the core vm. I think we first need to figure out if > we cannot do what you want through that method. We do need a different solution, i have been working on that for last 2 years for a reason. Requirement: _no_ special allocator in userspace so that all kind of memory (anonymous, share, file backed) can be use and migrated to device memory in a transparent fashion for the application. No special allocator imply no special vma so no special vm_ops. So we need either to hook up in few places inside mm code with minor change to deal with special CPU pte entry of migrated memory (on page fault, fork, write back). For all those place it's just about adding : if(new_special_pte) new_helper_function() Other solution would have been to introduce yet another vm_ops that would superceed the existing vm_ops. This work for page fault but require more changes for page fault and fork, and major changes for write back. Hence, why first solution was favor. I explored many different path before going down the road i am, and all you are doing is hand waving some idea without even considering any of the objection i formulated. I explained why your idea can not work or require excessive and more complex change than solution we are proposing. Cheers, J?r?me -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/