Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp3894059pxb; Mon, 8 Feb 2021 02:52:21 -0800 (PST) X-Google-Smtp-Source: ABdhPJwigz8JWBT09167o4c7qvgjzI/l8Aju0//O4/jaK3qKHQQZYSbnIDgjgCrx7nqYUPdweF7h X-Received: by 2002:a17:907:10c1:: with SMTP id rv1mr16975542ejb.74.1612781541194; Mon, 08 Feb 2021 02:52:21 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612781541; cv=none; d=google.com; s=arc-20160816; b=HZ/U3aUtP8Fwaaxqv3FCk3KiuABNoob/ATLCmS3o8J/ZbpmbazTo+fb0cdP0TLk6oQ lONNOUDGwgqfq7iLZywx1EKQfwQfPKMDWtjby71HOm/1ou3+1dhNRmD/x090FN9oM80P bGl7dZLbx0lCrpMjbSUCUEIGd+zmufZnP8PPhbzXQpP5l+oolxIRMG7eWcnNwB1zGZoD Fqmm44SIfqvYgrcVObb89aYMuqF+8ruJZ2YsMaiUgqXh90oIhQ4zYGZa9D7wue9n/TrG vgonCYqTUoGBtWsIsz4fZc8pyQipiZyZ2FRfZxo/AW2feZbYjpRPK0emSgLMROHFr24T 4FFQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:organization :from:references:cc:to:subject:dkim-signature; bh=pQkWudR5NaqUhP785ZyYy4UYb1aBaulu3BBFzJEhpos=; b=PWkKx1NIdwn9/mDUv6nusAba/xZYGpt54yNVzRM0x01h5uwU7Rn4IKWTNH9kcRNDHA Rxy92tPsYMXsSvOyh4lgVgcyGIMRcfOYXEJ9zGUlKaF1SluVjIdtvPCMYa7QMVRi4rSI NImlmp19iKLqh9pW0d6AuNURbTVPpTtgxgxUHdTLkKmMVvjktCEgx8mzl+vWAzlgS9ry 4wXOwyBbo7pFKntO/KRcjo1gNd6XxFmlCZQqYvjdjv3R4SITYG4oiEMrKRKOvK7Zkdi/ vDRWVOZOW7biiz2S4om910BsZjbQlWeIbr5ycQ3Ph9d+9ddK5CwEqvBGek1BblUi+NzR wbuA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MjCL8JNd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id c4si7068655eds.533.2021.02.08.02.51.57; Mon, 08 Feb 2021 02:52:21 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MjCL8JNd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231938AbhBHKuO (ORCPT + 99 others); Mon, 8 Feb 2021 05:50:14 -0500 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:32239 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232385AbhBHKjH (ORCPT ); Mon, 8 Feb 2021 05:39:07 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1612780658; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pQkWudR5NaqUhP785ZyYy4UYb1aBaulu3BBFzJEhpos=; b=MjCL8JNdaKUO5dP0n3re4TrFl/exSAbBtSaeilXyBpjEY3Y89Ti0uTC+eKmeUbH1BJYjav irKavMIAL9XFXhJ3LV1CFzMoMFa/VuA7hPuGfog9zUPk2yEPJzNyjdNOQjywMdknt+Bad3 GLvcHOx5j27os1xoDlEQKdjyMekTK9Y= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-487-wUgzenNPNBuCRVrN4cVAXQ-1; Mon, 08 Feb 2021 05:37:34 -0500 X-MC-Unique: wUgzenNPNBuCRVrN4cVAXQ-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 6815019611AF; Mon, 8 Feb 2021 10:37:32 +0000 (UTC) Received: from [10.36.113.240] (ovpn-113-240.ams2.redhat.com [10.36.113.240]) by smtp.corp.redhat.com (Postfix) with ESMTP id 972F51ABE1; Mon, 8 Feb 2021 10:37:24 +0000 (UTC) Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory pin To: "Song Bao Hua (Barry Song)" , Matthew Wilcox Cc: "Wangzhou (B)" , "linux-kernel@vger.kernel.org" , "iommu@lists.linux-foundation.org" , "linux-mm@kvack.org" , "linux-arm-kernel@lists.infradead.org" , "linux-api@vger.kernel.org" , Andrew Morton , Alexander Viro , "gregkh@linuxfoundation.org" , "jgg@ziepe.ca" , "kevin.tian@intel.com" , "jean-philippe@linaro.org" , "eric.auger@redhat.com" , "Liguozhu (Kenneth)" , "zhangfei.gao@linaro.org" , "chensihang (A)" References: <1612685884-19514-1-git-send-email-wangzhou1@hisilicon.com> <1612685884-19514-2-git-send-email-wangzhou1@hisilicon.com> <20210207213409.GL308988@casper.infradead.org> <20210208013056.GM308988@casper.infradead.org> From: David Hildenbrand Organization: Red Hat GmbH Message-ID: Date: Mon, 8 Feb 2021 11:37:24 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08.02.21 11:13, Song Bao Hua (Barry Song) wrote: > > >> -----Original Message----- >> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf Of >> David Hildenbrand >> Sent: Monday, February 8, 2021 9:22 PM >> To: Song Bao Hua (Barry Song) ; Matthew Wilcox >> >> Cc: Wangzhou (B) ; linux-kernel@vger.kernel.org; >> iommu@lists.linux-foundation.org; linux-mm@kvack.org; >> linux-arm-kernel@lists.infradead.org; linux-api@vger.kernel.org; Andrew >> Morton ; Alexander Viro ; >> gregkh@linuxfoundation.org; jgg@ziepe.ca; kevin.tian@intel.com; >> jean-philippe@linaro.org; eric.auger@redhat.com; Liguozhu (Kenneth) >> ; zhangfei.gao@linaro.org; chensihang (A) >> >> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory >> pin >> >> On 08.02.21 03:27, Song Bao Hua (Barry Song) wrote: >>> >>> >>>> -----Original Message----- >>>> From: owner-linux-mm@kvack.org [mailto:owner-linux-mm@kvack.org] On Behalf >> Of >>>> Matthew Wilcox >>>> Sent: Monday, February 8, 2021 2:31 PM >>>> To: Song Bao Hua (Barry Song) >>>> Cc: Wangzhou (B) ; linux-kernel@vger.kernel.org; >>>> iommu@lists.linux-foundation.org; linux-mm@kvack.org; >>>> linux-arm-kernel@lists.infradead.org; linux-api@vger.kernel.org; Andrew >>>> Morton ; Alexander Viro >> ; >>>> gregkh@linuxfoundation.org; jgg@ziepe.ca; kevin.tian@intel.com; >>>> jean-philippe@linaro.org; eric.auger@redhat.com; Liguozhu (Kenneth) >>>> ; zhangfei.gao@linaro.org; chensihang (A) >>>> >>>> Subject: Re: [RFC PATCH v3 1/2] mempinfd: Add new syscall to provide memory >>>> pin >>>> >>>> On Sun, Feb 07, 2021 at 10:24:28PM +0000, Song Bao Hua (Barry Song) wrote: >>>>>>> In high-performance I/O cases, accelerators might want to perform >>>>>>> I/O on a memory without IO page faults which can result in dramatically >>>>>>> increased latency. Current memory related APIs could not achieve this >>>>>>> requirement, e.g. mlock can only avoid memory to swap to backup device, >>>>>>> page migration can still trigger IO page fault. >>>>>> >>>>>> Well ... we have two requirements. The application wants to not take >>>>>> page faults. The system wants to move the application to a different >>>>>> NUMA node in order to optimise overall performance. Why should the >>>>>> application's desires take precedence over the kernel's desires? And why >>>>>> should it be done this way rather than by the sysadmin using numactl to >>>>>> lock the application to a particular node? >>>>> >>>>> NUMA balancer is just one of many reasons for page migration. Even one >>>>> simple alloc_pages() can cause memory migration in just single NUMA >>>>> node or UMA system. >>>>> >>>>> The other reasons for page migration include but are not limited to: >>>>> * memory move due to CMA >>>>> * memory move due to huge pages creation >>>>> >>>>> Hardly we can ask users to disable the COMPACTION, CMA and Huge Page >>>>> in the whole system. >>>> >>>> You're dodging the question. Should the CMA allocation fail because >>>> another application is using SVA? >>>> >>>> I would say no. >>> >>> I would say no as well. >>> >>> While IOMMU is enabled, CMA almost has one user only: IOMMU driver >>> as other drivers will depend on iommu to use non-contiguous memory >>> though they are still calling dma_alloc_coherent(). >>> >>> In iommu driver, dma_alloc_coherent is called during initialization >>> and there is no new allocation afterwards. So it wouldn't cause >>> runtime impact on SVA performance. Even there is new allocations, >>> CMA will fall back to general alloc_pages() and iommu drivers are >>> almost allocating small memory for command queues. >>> >>> So I would say general compound pages, huge pages, especially >>> transparent huge pages, would be bigger concerns than CMA for >>> internal page migration within one NUMA. >>> >>> Not like CMA, general alloc_pages() can get memory by moving >>> pages other than those pinned. >>> >>> And there is no guarantee we can always bind the memory of >>> SVA applications to single one NUMA, so NUMA balancing is >>> still a concern. >>> >>> But I agree we need a way to make CMA success while the userspace >>> pages are pinned. Since pin has been viral in many drivers, I >>> assume there is a way to handle this. Otherwise, APIs like >>> V4L2_MEMORY_USERPTR[1] will possibly make CMA fail as there >>> is no guarantee that usersspace will allocate unmovable memory >>> and there is no guarantee the fallback path- alloc_pages() can >>> succeed while allocating big memory. >>> >> >> Long term pinnings cannot go onto CMA-reserved memory, and there is >> similar work to also fix ZONE_MOVABLE in that regard. >> >> https://lkml.kernel.org/r/20210125194751.1275316-1-pasha.tatashin@soleen.c >> om >> >> One of the reasons I detest using long term pinning of pages where it >> could be avoided. Take VFIO and RDMA as an example: these things >> currently can't work without them. >> >> What I read here: "DMA performance will be affected severely". That does >> not sound like a compelling argument to me for long term pinnings. >> Please find another way to achieve the same goal without long term >> pinnings controlled by user space - e.g., controlling when migration >> actually happens. >> >> For example, CMA/alloc_contig_range()/memory unplug are corner cases >> that happen rarely, you shouldn't have to worry about them messing with >> your DMA performance. > > I agree CMA/alloc_contig_range()/memory unplug would be corner cases, > the major cases would be THP, NUMA balancing while we could totally > disable them but it seems insensible to do that only because there is > a process using SVA in the system. Can't you use huge pages in your application that uses SVA and prevent THP/NUMA balancing from kicking in? -- Thanks, David / dhildenb