Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp5973237imu; Wed, 30 Jan 2019 06:45:00 -0800 (PST) X-Google-Smtp-Source: ALg8bN40JgRxXYHD8Dv4iQ5WDdXxPhuTDEvCyQNZ4fOPBqiGZCWTAbp+nFVoBQWD1yO4pIE4+nBx X-Received: by 2002:a65:60c2:: with SMTP id r2mr28094021pgv.393.1548859499899; Wed, 30 Jan 2019 06:44:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548859499; cv=none; d=google.com; s=arc-20160816; b=yZk1W8dpWKT24vuj759UCrjJ1eAM+9NKDeel03HvZHqdIPVeIblS1foiPOJQw3pBl2 BpziaNwK9+lp049+wfQdN8a+A9ZMvpBC31lCb8c/VzL2PsWBoD8DiZkiKdqkWbAldR7w wKLM5GNUygUznb3kmsOlEufsFE84vJDbyMQCy7GCf9AkawPhWo85l3ChVVci0jyzV3Ps HpIi7YXTbpXcE4AbeZKRU+BVtSmZjLhU+qUCZ4qUmeStdh66uvyVbaIGeuMiwBZ4j/Oh +m1xsoJqwLcfmLneq2VeCOrN08KqDciJXVB08m3Anz1YPB1rZ6090iKc1eW8RxFJEbY6 rz3w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=Hdf6/8FKnVMgi2EBr9MfxQjMIX2eLqfWUCAXJCy0C/g=; b=hU3lka2FjDXitR2Hg65NdLLKaXjeNBIMvRDztWKfiau0dR4HK+5oyVXj6vZIgA0AlE DqqQYI/BxXWFN9g3u6MAzl+PBq1Sy37Nj+kstv5UQX9WQZUjb8dDTw3WVMfN9lX4rWTh MOwO70vGrJOs3RIkcPJdL22S9/Wn9SECr897QUZV7PduMWCES7WNtzUVNYZbxEVFLwUO x9pm6VqQMQ2QEWhaoyZMXWIWavNqhUUeHK+TFv4H+UK6H5zaSTRrF11K4Um9a+sVF05P 8c60ox9ASAUseBieFfPTVwoZ5zHBn+1iQUNzDuraRtPOmViCyI++YAqtVpld9oT/dzzg fwBQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y6si1602655pll.384.2019.01.30.06.44.43; Wed, 30 Jan 2019 06:44:59 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731272AbfA3OnK (ORCPT + 99 others); Wed, 30 Jan 2019 09:43:10 -0500 Received: from mx1.redhat.com ([209.132.183.28]:57218 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727555AbfA3OnJ (ORCPT ); Wed, 30 Jan 2019 09:43:09 -0500 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id E8534A402E; Wed, 30 Jan 2019 14:43:08 +0000 (UTC) Received: from sky.random (ovpn-121-14.rdu2.redhat.com [10.10.121.14]) by smtp.corp.redhat.com (Postfix) with ESMTPS id A306718823; Wed, 30 Jan 2019 14:43:05 +0000 (UTC) Date: Wed, 30 Jan 2019 09:43:04 -0500 From: Andrea Arcangeli To: Mike Rapoport Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Peter Xu , Blake Caldwell , Mike Rapoport , Mike Kravetz , Michal Hocko , Mel Gorman , Vlastimil Babka , David Rientjes , Andrei Vagin , Pavel Emelyanov Subject: Re: [LSF/MM TOPIC]: userfaultfd (was: [LSF/MM TOPIC] NUMA remote THP vs NUMA local non-THP under MADV_HUGEPAGE) Message-ID: <20190130144304.GA19021@redhat.com> References: <20190129234058.GH31695@redhat.com> <20190130081336.GC17937@rapoport-lnx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190130081336.GC17937@rapoport-lnx> User-Agent: Mutt/1.11.2 (2019-01-07) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Wed, 30 Jan 2019 14:43:09 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Mike, On Wed, Jan 30, 2019 at 10:13:36AM +0200, Mike Rapoport wrote: > We (CRIU) have some concerns about obsoleting soft-dirty in favor of > uffd-wp. If there are other soft-dirty users these concerns would be > relevant to them as well. > > With soft-dirty we collect the information about the changed memory every > pre-dump iteration in the following manner: > * freeze the tasks > * find entries in /proc/pid/pagemap with SOFT_DIRTY set > * unfreeze the tasks > * dump the modified pages to disk/remote host > > While we do need to traverse the /proc/pid/pagemap to identify dirty pages, > in between the pre-dump iterations and during the actual memory dump the > tasks are running freely. > > If we are to switch to uffd-wp, every write by the snapshotted/migrated > task will incur latency of uffd-wp processing by the monitor. That's valid concern indeed. I didn't go into the details of what additional feature is needed in addition to what is already present present in Peter's current patchset, but you're correct that in order to perform well to do the softdirty equivalent, we'll also need to add an async event model. The async event model would be set during UFFD registration. It'd work like async signals, you just queue up uffd events in the kernel by allocating them with a slab object (not in the kernel stack of the faulting process). Only if the monitor won't read() them fast enough it'll eventually block the write protect fault and release the mmap_sem but the page fault would always be resolved by the kernel even in that case. For the monitor there'll be just a stream of uffd_msg structures to read in multiples of the uffd_msg structure size with a single syscall per wakeup of the monitor. Conceptually it'd work the same as how PML works for EPT. The main downside will be an allocation per fault (soft dirty doesn't need to do such allocation), but there will be no round-trip to userland latency added to the wrprotect fault that needs to be logged. We need the synchronous/blocking uffd-wp for other things that aren't related to soft dirty and can't be achieved with an async model like softdirty. Adding an async model later would be a self contained feature inside uffd. So the idea would be to ignore any comparison with softdirty until uffd-wp is finalized, and then evaluate the possibility of adding an async model which would be simple thing to add in comparison of the uffd-wp feature itself. The theoretical expectation would be that softdirty would perform better for small processes (but for those the overall logging overhead is small anyway), but when it gets to the hundred-gigabytes/terabytes regions, async uffd-wp should perform much better. Thanks, Andrea