Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755592AbcJYE7w (ORCPT ); Tue, 25 Oct 2016 00:59:52 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:59344 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753869AbcJYE7t (ORCPT ); Tue, 25 Oct 2016 00:59:49 -0400 From: "Aneesh Kumar K.V" To: Jerome Glisse , Anshuman Khandual Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.com, js1304@gmail.com, vbabka@suse.cz, mgorman@suse.de, minchan@kernel.org, akpm@linux-foundation.org, bsingharora@gmail.com Subject: Re: [RFC 0/8] Define coherent device memory node In-Reply-To: <20161024170902.GA5521@gmail.com> References: <1477283517-2504-1-git-send-email-khandual@linux.vnet.ibm.com> <20161024170902.GA5521@gmail.com> Date: Tue, 25 Oct 2016 10:29:38 +0530 MIME-Version: 1.0 Content-Type: text/plain X-TM-AS-GCONF: 00 X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16102504-0028-0000-0000-000005E11014 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00005974; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000187; SDB=6.00772376; UDB=6.00370726; IPR=6.00549196; BA=6.00004830; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00013097; XFM=3.00000011; UTC=2016-10-25 04:59:46 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16102504-0029-0000-0000-0000304B4630 Message-Id: <877f8xaurp.fsf@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-10-25_02:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=1 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1610250084 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6767 Lines: 142 Jerome Glisse writes: > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > >> [...] > >> Core kernel memory features like reclamation, evictions etc. might >> need to be restricted or modified on the coherent device memory node as >> they can be performance limiting. The RFC does not propose anything on this >> yet but it can be looked into later on. For now it just disables Auto NUMA >> for any VMA which has coherent device memory. >> >> Seamless integration of coherent device memory with system memory >> will enable various other features, some of which can be listed as follows. >> >> a. Seamless migrations between system RAM and the coherent memory >> b. Will have asynchronous and high throughput migrations >> c. Be able to allocate huge order pages from these memory regions >> d. Restrict allocations to a large extent to the tasks using the >> device for workload acceleration >> >> Before concluding, will look into the reasons why the existing >> solutions don't work. There are two basic requirements which have to be >> satisfies before the coherent device memory can be integrated with core >> kernel seamlessly. >> >> a. PFN must have struct page >> b. Struct page must able to be inside standard LRU lists >> >> The above two basic requirements discard the existing method of >> device memory representation approaches like these which then requires the >> need of creating a new framework. > > I do not believe the LRU list is a hard requirement, yes when faulting in > a page inside the page cache it assumes it needs to be added to lru list. > But i think this can easily be work around. > > In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU > (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...) > so in my case a file back page must always be spawn first from a regular > page and once read from disk then i can migrate to GPU page. > > So if you accept this intermediary step you can easily use ZONE_DEVICE for > device memory. This way no lru, no complex dance to make the memory out of > reach from regular memory allocator. > > I think we would have much to gain if we pool our effort on a single common > solution for device memory. In my case the device memory is not accessible > by the CPU (because PCIE restrictions), in your case it is. Thus the only > difference is that in my case it can not be map inside the CPU page table > while in yours it can. > >> >> (1) Traditional ioremap >> >> a. Memory is mapped into kernel (linear and virtual) and user space >> b. These PFNs do not have struct pages associated with it >> c. These special PFNs are marked with special flags inside the PTE >> d. Cannot participate in core VM functions much because of this >> e. Cannot do easy user space migrations >> >> (2) Zone ZONE_DEVICE >> >> a. Memory is mapped into kernel and user space >> b. PFNs do have struct pages associated with it >> c. These struct pages are allocated inside it's own memory range >> d. Unfortunately the struct page's union containing LRU has been >> used for struct dev_pagemap pointer >> e. Hence it cannot be part of any LRU (like Page cache) >> f. Hence file cached mapping cannot reside on these PFNs >> g. Cannot do easy migrations >> >> I had also explored non LRU representation of this coherent device >> memory where the integration with system RAM in the core VM is limited only >> to the following functions. Not being inside LRU is definitely going to >> reduce the scope of tight integration with system RAM. >> >> (1) Migration support between system RAM and coherent memory >> (2) Migration support between various coherent memory nodes >> (3) Isolation of the coherent memory >> (4) Mapping the coherent memory into user space through driver's >> struct vm_operations >> (5) HW poisoning of the coherent memory >> >> Allocating the entire memory of the coherent device node right >> after hot plug into ZONE_MOVABLE (where the memory is already inside the >> buddy system) will still expose a time window where other user space >> allocations can come into the coherent device memory node and prevent the >> intended isolation. So traditional hot plug is not the solution. Hence >> started looking into CMA based non LRU solution but then hit the following >> roadblocks. >> >> (1) CMA does not support hot plugging of new memory node >> a. CMA area needs to be marked during boot before buddy is >> initialized >> b. cma_alloc()/cma_release() can happen on the marked area >> c. Should be able to mark the CMA areas just after memory hot plug >> d. cma_alloc()/cma_release() can happen later after the hot plug >> e. This is not currently supported right now >> >> (2) Mapped non LRU migration of pages >> a. Recent work from Michan Kim makes non LRU page migratable >> b. But it still does not support migration of mapped non LRU pages >> c. With non LRU CMA reserved, again there are some additional >> challenges >> >> With hot pluggable CMA and non LRU mapped migration support there >> may be an alternate approach to represent coherent device memory. Please >> do review this RFC proposal and let me know your comments or suggestions. >> Thank you. > > You can take a look at hmm-v13 if you want to see how i do non LRU page > migration. While i put most of the migration code inside hmm_migrate.c it > could easily be move to migrate.c without hmm_ prefix. > > There is 2 missing piece with existing migrate code. First is to put memory > allocation for destination under control of who call the migrate code. Second > is to allow offloading the copy operation to device (ie not use the CPU to > copy data). > > I believe same requirement also make sense for platform you are targeting. > Thus same code can be use. > > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13 > > I haven't posted this patchset yet because we are doing some modifications > to the device driver API to accomodate some new features. But the ZONE_DEVICE > changes and the overall migration code will stay the same more or less (i have > patches that move it to migrate.c and share more code with existing migrate > code). > > If you think i missed anything about lru and page cache please point it to > me. Because when i audited code for that i didn't see any road block with > the few fs i was looking at (ext4, xfs and core page cache code). > The other restriction around ZONE_DEVICE is, it is not a managed zone. That prevents any direct allocation from coherent device by application. ie, we would like to force allocation from coherent device using interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ? -aneeesh