Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S941679AbcJYPdB (ORCPT ); Tue, 25 Oct 2016 11:33:01 -0400 Received: from mail-qt0-f195.google.com ([209.85.216.195]:32982 "EHLO mail-qt0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934149AbcJYPdA (ORCPT ); Tue, 25 Oct 2016 11:33:00 -0400 Date: Tue, 25 Oct 2016 11:32:56 -0400 From: Jerome Glisse To: "Aneesh Kumar K.V" Cc: Anshuman Khandual , linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.com, js1304@gmail.com, vbabka@suse.cz, mgorman@suse.de, minchan@kernel.org, akpm@linux-foundation.org, bsingharora@gmail.com Subject: Re: [RFC 0/8] Define coherent device memory node Message-ID: <20161025153256.GB6131@gmail.com> References: <1477283517-2504-1-git-send-email-khandual@linux.vnet.ibm.com> <20161024170902.GA5521@gmail.com> <877f8xaurp.fsf@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <877f8xaurp.fsf@linux.vnet.ibm.com> User-Agent: Mutt/1.7.1 (2016-10-04) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3196 Lines: 62 On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse writes: > > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: [...] > > You can take a look at hmm-v13 if you want to see how i do non LRU page > > migration. While i put most of the migration code inside hmm_migrate.c it > > could easily be move to migrate.c without hmm_ prefix. > > > > There is 2 missing piece with existing migrate code. First is to put memory > > allocation for destination under control of who call the migrate code. Second > > is to allow offloading the copy operation to device (ie not use the CPU to > > copy data). > > > > I believe same requirement also make sense for platform you are targeting. > > Thus same code can be use. > > > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > > > I haven't posted this patchset yet because we are doing some modifications > > to the device driver API to accomodate some new features. But the ZONE_DEVICE > > changes and the overall migration code will stay the same more or less (i have > > patches that move it to migrate.c and share more code with existing migrate > > code). > > > > If you think i missed anything about lru and page cache please point it to > > me. Because when i audited code for that i didn't see any road block with > > the few fs i was looking at (ext4, xfs and core page cache code). > > > > The other restriction around ZONE_DEVICE is, it is not a managed zone. > That prevents any direct allocation from coherent device by application. > ie, we would like to force allocation from coherent device using > interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? To achieve this we rely on device fault code path ie when device take a page fault with help of HMM it will use existing memory if any for fault address but if CPU page table is empty (and it is not file back vma because of readback) then device can directly allocate device memory and HMM will update CPU page table to point to newly allocated device memory. So in fact i am not using existing kernel API to achieve this but the whole policy of where to allocate and what to allocate is under device driver responsability and device driver leverage its existing userspace API to get proper hint/direction from the application. Device memory is really a special case in my view, it only make sense to use it if memory is actively access by device and only way device access memory is when it is program to do so through the device driver API. There is nothing such as GPU threads in the kernel and there is no way to spawn or move work thread to GPU. This are specialize device and they require special per device API. So in my view using existing kernel API such as mbind() is counter productive. You might have buggy software that will mbind their memory to device and never use the device which lead to device memory being wasted for a process that never use the device. So my opinion is that you should not try to use existing kernel API to get policy information from userspace but let the device driver gather such policy through its own private API. Cheers, J?r?me