Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753206AbbD0Psg (ORCPT ); Mon, 27 Apr 2015 11:48:36 -0400 Received: from mail-vn0-f50.google.com ([209.85.216.50]:46210 "EHLO mail-vn0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752666AbbD0Psf (ORCPT ); Mon, 27 Apr 2015 11:48:35 -0400 Date: Mon, 27 Apr 2015 11:47:29 -0400 From: Jerome Glisse To: Christoph Lameter Cc: "Paul E. McKenney" , Benjamin Herrenschmidt , linux-kernel@vger.kernel.org, linux-mm@kvack.org, jglisse@redhat.com, mgorman@suse.de, aarcange@redhat.com, riel@redhat.com, airlied@redhat.com, aneesh.kumar@linux.vnet.ibm.com, Cameron Buschardt , Mark Hairgrove , Geoffrey Gerfin , John McKenna , akpm@linux-foundation.org Subject: Re: Interacting with coherent memory on external devices Message-ID: <20150427154728.GA26980@gmail.com> References: <20150424150829.GA3840@gmail.com> <20150424164325.GD3840@gmail.com> <20150424171957.GE3840@gmail.com> <20150424192859.GF3840@gmail.com> <20150425114633.GI5561@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5055 Lines: 108 On Mon, Apr 27, 2015 at 10:08:29AM -0500, Christoph Lameter wrote: > On Sat, 25 Apr 2015, Paul E. McKenney wrote: > > > Would you have a URL or other pointer to this code? > > linux/mm/migrate.c > > > > > Without modifying a single line of mm code, the only way to do this is to > > > > either unmap from the cpu page table the range being migrated or to mprotect > > > > it in some way. In both case the cpu access will trigger some kind of fault. > > > > > > Yes that is how Linux migration works. If you can fix that then how about > > > improving page migration in Linux between NUMA nodes first? > > > > In principle, that also would be a good thing. But why do that first? > > Because it would benefit a lot of functionality that today relies on page > migration to have a faster more reliable way of moving pages around. I do no think in the CAPI case there is anyway to improve on current low leve page migration. I am talking about : - write protect & tlb flush - copy - update page table tlb flush The upper level that have the logic for the migration would however need some change. Like Paul said some kind of new metric and also new way to gather statistics from device instead from CPU. I think the device can provide better informations that the actual logic where page are unmap and the kernel look which CPU fault on page first. Also a way to allow hint provide by userspace through the device driver into the numa decision process. So i do not think that anything in this work would benefit any other work load then the one Paul is interested in. Still i am sure Paul want to build on top of existing infrastructure. > > > > > This is not the behavior we want. What we want is same address space while > > > > being able to migrate system memory to device memory (who make that decision > > > > should not be part of that discussion) while still gracefully handling any > > > > CPU access. > > > > > > Well then there could be a situation where you have concurrent write > > > access. How do you reconcile that then? Somehow you need to stall one or > > > the other until the transaction is complete. > > > > Or have store buffers on one or both sides. > > Well if those store buffers end up with divergent contents then you have > the problem of not being able to decide which version should survive. But > from Jerome's response I deduce that this is avoided by only allow > read-only access during migration. That is actually similar to what page > migration does. Yes, as said above no change to the logic there, we do not want divergent content at all. The thing is, autonuma is a better fit for Paul because Paul platform being more advance he can allocate struct page for the device memory. While in my case it would be pointless as the memory is not CPU accessible. This is why the HMM patchset do not build on top of autonuma and current page migration but still use the same kind of logic. > > > > > This means if CPU access it we want to migrate memory back to system memory. > > > > To achieve this there is no way around adding couple of if inside the mm > > > > page fault code path. Now do you want each driver to add its own if branch > > > > or do you want a common infrastructure to do just that ? > > > > > > If you can improve the page migration in general then we certainly would > > > love that. Having faultless migration is certain a good thing for a lot of > > > functionality that depends on page migration. > > > > We do have to start somewhere, though. If we insist on perfection for > > all situations before we agree to make a change, we won't be making very > > many changes, now will we? > > Improvements to the general code would be preferred instead of > having specialized solutions for a particular hardware alone. If the > general code can then handle the special coprocessor situation then we > avoid a lot of code development. I think Paul only big change would be the memory ZONE changes. Having a way to add the device memory as struct page while blocking the kernel allocation from using this memory. Beside that i think the autonuma changes he would need would really be specific to his usecase but would still reuse all of the low level logic. > > > As I understand it, the trick (if you can call it that) is having the > > device have the same memory-mapping capabilities as the CPUs. > > Well yes that works with read-only mappings. Maybe we can special case > that in the page migration code? We do not need migration entries if > access is read-only actually. The duplicate read only memory on device, is really an optimization that is not critical to the whole. The common use case remain the migration of read & write memory to device memory when the memory is mostly/only accessed by the device. Cheers, J?r?me -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/