Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp5670856imu; Wed, 30 Jan 2019 01:23:53 -0800 (PST) X-Google-Smtp-Source: ALg8bN4npZGLheVhpZzf9LcOJj+jX4urnyhjt5HPnTNTpQxShVmrkKUxvTE/4dL3CtMASXg3ul2y X-Received: by 2002:a17:902:b60a:: with SMTP id b10mr27874448pls.303.1548840233215; Wed, 30 Jan 2019 01:23:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548840233; cv=none; d=google.com; s=arc-20160816; b=bEuIQVXgMWiJoZE+drmnpTr6RkRpCd6xCdVE+8tW56dzK+iKWEUqze6i3QMQ5/qZQh XEPFL3wQtJZKi2zx9M5PHCJY4ZywUodUgfD63JJuAyav4mGTbrJgBu229c9IHruPKL7v CRB2EXfzZH/d41n6/gonin8e7+mqv0af/z5Y64+b2qy9GPtFMvqr7mNHiIZFAZclDtpj /yV2PDY/qLMo4AnghITEYEpV80NdM3fEqTQotpzrTKlIpx3r78+x/PqyeyX6ByuCDL9S 8fAW6m94ZlR6MRL7yTflGgy9yqnt/G59GZM6WBXvS1nqos6RM8I5fFW/332hlZZ76Lvd AcsQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=VbAkpVm6aUtjplh15xW+8XFpBdphGmpE5hR4xv/5y9w=; b=DOMgQf5M0bNii07f0L3SSmKeX4XEGP1YEL5K5KKcuoemf9wwuK1UEx5Z1+yrpgT1IX LRL5sQLF8WxlPCRboi7UlKk6Zfu9o+XBRHOn6adxMX+JaK4okeEaL3xhhc213Vnav9IS TphvzCrF6Mi5uVjdnbXWPc/u6YwFeLhUH41ywup4Yi3PIeJyVgJavs0e9uj2cWtQ1ZiF jXgh6DR0zQ8boZ/KH3A9NqLnt0OZimx4qiYZoJ5TI16nj34kpGPtF+C92x9UBs7kLnQA X9e/rFtV4yT9+0W0MQFWSJ9UDPS5oneaDNwokIY3EepjJBRjMI+qnW24Q6yQIhB73aNW /25A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 9si922351pgm.112.2019.01.30.01.23.38; Wed, 30 Jan 2019 01:23:53 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730265AbfA3JXM (ORCPT + 99 others); Wed, 30 Jan 2019 04:23:12 -0500 Received: from mx1.redhat.com ([209.132.183.28]:56088 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726427AbfA3JXM (ORCPT ); Wed, 30 Jan 2019 04:23:12 -0500 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 9AEF98AE6F; Wed, 30 Jan 2019 09:23:11 +0000 (UTC) Received: from xz-x1 (dhcp-14-116.nay.redhat.com [10.66.14.116]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 256A71A7CE; Wed, 30 Jan 2019 09:23:04 +0000 (UTC) Date: Wed, 30 Jan 2019 17:23:02 +0800 From: Peter Xu To: Mike Rapoport Cc: Andrea Arcangeli , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Blake Caldwell , Mike Rapoport , Mike Kravetz , Michal Hocko , Mel Gorman , Vlastimil Babka , David Rientjes , Andrei Vagin , Pavel Emelyanov Subject: Re: [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE) Message-ID: <20190130092302.GA25119@xz-x1> References: <20190129234058.GH31695@redhat.com> <20190130081336.GC17937@rapoport-lnx> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20190130081336.GC17937@rapoport-lnx> User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Wed, 30 Jan 2019 09:23:11 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 30, 2019 at 10:13:36AM +0200, Mike Rapoport wrote: > Hi, > > (changed the subject and added CRIU folks) > > On Tue, Jan 29, 2019 at 06:40:58PM -0500, Andrea Arcangeli wrote: > > Hello, > > > > -- > > > > In addition to the above "NUMA remote THP vs NUMA local non-THP > > tradeoff" topic, there are other developments in "userfaultfd" land that > > are approaching merge readiness and that would be possible to provide a > > short overview about: > > > > - Peter Xu made significant progress in finalizing the userfaultfd-WP > > support over the last few months. That feature was planned from the > > start and it will allow userland to do some new things that weren't > > possible to achieve before. In addition to synchronously blocking > > write faults to be resolved by an userland manager, it has also the > > ability to obsolete the softdirty feature, because it can provide > > the same information, but with O(1) complexity (as opposed of the > > current softdirty O(N) complexity) similarly to what the Page > > Modification Logging (PML) does in hardware for EPT write accesses. > > We (CRIU) have some concerns about obsoleting soft-dirty in favor of > uffd-wp. If there are other soft-dirty users these concerns would be > relevant to them as well. > > With soft-dirty we collect the information about the changed memory every > pre-dump iteration in the following manner: > * freeze the tasks > * find entries in /proc/pid/pagemap with SOFT_DIRTY set > * unfreeze the tasks > * dump the modified pages to disk/remote host > > While we do need to traverse the /proc/pid/pagemap to identify dirty pages, > in between the pre-dump iterations and during the actual memory dump the > tasks are running freely. > > If we are to switch to uffd-wp, every write by the snapshotted/migrated > task will incur latency of uffd-wp processing by the monitor. > > We'd need to see how this affects overall slowdown of the workload under > migration before moving forward with obsoleting soft-dirty. > > > - Blake Caldwell maintained the UFFDIO_REMAP support to atomically > > remove memory from a mapping with userfaultfd (which can't be done > > with a copy as in UFFDIO_COPY and it requires a slow TLB flush to be > > safe) as an alternative to host swapping (which of course also > > requires a TLB flush for similar reasons). Notably UFFDIO_REMAP was > > rightfully naked early on and quickly replaced by UFFDIO_COPY which > > is more optimal to add memory to a mapping is small chunks, but we > > can't remove memory with UFFDIO_COPY and UFFDIO_REMAP should be as > > efficient as it gets when it comes to removing memory from a > > mapping. > > If we are to discuss userfaultfd, I'd like also to bring the subject of COW > mappings. > The pages populated with UFFDIO_COPY cannot be COW-shared between related > processes which unnecessarily increases memory footprint of a migrated > process tree. > I've posted a patch [1] a (real) while ago, but nobody reacted and I've put > this aside. > Maybe it's time to discuss it again :) Hi, Mike, It's interesting to know such a work... Since I really don't have much context on this, so sorry if I'm going to ask a silly question... but I'd say when reading this I'm thinking of KSM. I think KSM does not suite in this case since when doing UFFDIO_COPY_COW it'll contain hinting information while KSM was only scanning over the pages between processes which seems to be O(N*N) if assuming there're two processes. However, would it make any sense to provide a general interface to scan for same pages between any two processes within specific range and merge them if found (rather than a specific interface for userfaultfd only)? Then it might even be used by KSM admins (just as an example) when the admin knows exactly that memory range (addr1, len) of process A should very probably has many same contents as the memory range (addr2, len) of process B? Thanks, -- Peter Xu