Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752682AbdGSCZn (ORCPT ); Tue, 18 Jul 2017 22:25:43 -0400 Received: from mx1.redhat.com ([209.132.183.28]:46046 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751906AbdGSCZm (ORCPT ); Tue, 18 Jul 2017 22:25:42 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 780347CE0E Authentication-Results: ext-mx04.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx04.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=jglisse@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 780347CE0E Date: Tue, 18 Jul 2017 22:25:38 -0400 From: Jerome Glisse To: Bob Liu Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, John Hubbard , David Nellans , Dan Williams , Balbir Singh , Michal Hocko Subject: Re: [PATCH 0/6] Cache coherent device memory (CDM) with HMM v5 Message-ID: <20170719022537.GA6911@redhat.com> References: <20170713211532.970-1-jglisse@redhat.com> <2d534afc-28c5-4c81-c452-7e4c013ab4d0@huawei.com> <20170718153816.GA3135@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.8.3 (2017-05-23) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Wed, 19 Jul 2017 02:25:42 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3429 Lines: 76 On Wed, Jul 19, 2017 at 09:46:10AM +0800, Bob Liu wrote: > On 2017/7/18 23:38, Jerome Glisse wrote: > > On Tue, Jul 18, 2017 at 11:26:51AM +0800, Bob Liu wrote: > >> On 2017/7/14 5:15, J?r?me Glisse wrote: > >>> Sorry i made horrible mistake on names in v4, i completly miss- > >>> understood the suggestion. So here i repost with proper naming. > >>> This is the only change since v3. Again sorry about the noise > >>> with v4. > >>> > >>> Changes since v4: > >>> - s/DEVICE_HOST/DEVICE_PUBLIC > >>> > >>> Git tree: > >>> https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-cdm-v5 > >>> > >>> > >>> Cache coherent device memory apply to architecture with system bus > >>> like CAPI or CCIX. Device connected to such system bus can expose > >>> their memory to the system and allow cache coherent access to it > >>> from the CPU. > >>> > >>> Even if for all intent and purposes device memory behave like regular > >>> memory, we still want to manage it in isolation from regular memory. > >>> Several reasons for that, first and foremost this memory is less > >>> reliable than regular memory if the device hangs because of invalid > >>> commands we can loose access to device memory. Second CPU access to > >>> this memory is expected to be slower than to regular memory. Third > >>> having random memory into device means that some of the bus bandwith > >>> wouldn't be available to the device but would be use by CPU access. > >>> > >>> This is why we want to manage such memory in isolation from regular > >>> memory. Kernel should not try to use this memory even as last resort > >>> when running out of memory, at least for now. > >>> > >> > >> I think set a very large node distance for "Cache Coherent Device Memory" > >> may be a easier way to address these concerns. > > > > Such approach was discuss at length in the past see links below. Outcome > > of discussion: > > - CPU less node are bad > > - device memory can be unreliable (device hang) no way for application > > to understand that > > Device memory can also be more reliable if using high quality and expensive memory. Even ECC memory does not compensate for device hang. When your GPU lockups you might need to re-init GPU from scratch after which the content of the device memory is unreliable. During init the device memory might not get proper clock or proper refresh cycle and thus is susceptible to corruption. > > > - application and driver NUMA madvise/mbind/mempolicy ... can conflict > > with each other and no way the kernel can figure out which should > > apply > > - NUMA as it is now would not work as we need further isolation that > > what a large node distance would provide > > > > Agree, that's where we need spend time on. > > One drawback of HMM-CDM I'm worry about is one more extra copy. > In the cache coherent case, CPU can write data to device memory > directly then start fpga/GPU/other accelerators. There is not necessarily an extra copy. Device driver can pre-allocate virtual address range of a process with device memory. Device page fault can directly allocate device memory. Once allocated CPU access will use the device memory. There is plan to allow other allocation (CPU page fault, file cache, ...) to also use device memory directly. We just don't know what kind of userspace API will fit best for that so at first it might be hidden behind device driver specific ioctl. J?r?me