Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751134AbdFANp0 (ORCPT ); Thu, 1 Jun 2017 09:45:26 -0400 Received: from mx1.redhat.com ([209.132.183.28]:37454 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751058AbdFANpZ (ORCPT ); Thu, 1 Jun 2017 09:45:25 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 117DD80C2E Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=aarcange@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 117DD80C2E Date: Thu, 1 Jun 2017 15:45:22 +0200 From: Andrea Arcangeli To: Michal Hocko Cc: Mike Rapoport , Vlastimil Babka , "Kirill A. Shutemov" , Andrew Morton , Arnd Bergmann , "Kirill A. Shutemov" , Pavel Emelyanov , linux-mm , lkml , Linux API Subject: Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE Message-ID: <20170601134522.GE302@redhat.com> References: <20170524103947.GC3063@rapoport-lnx> <20170524111800.GD14733@dhcp22.suse.cz> <20170524142735.GF3063@rapoport-lnx> <20170530074408.GA7969@dhcp22.suse.cz> <20170530101921.GA25738@rapoport-lnx> <20170530103930.GB7969@dhcp22.suse.cz> <20170530140456.GA8412@redhat.com> <20170530143941.GK7969@dhcp22.suse.cz> <20170601065302.GA30495@rapoport-lnx> <20170601080909.GD32677@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170601080909.GD32677@dhcp22.suse.cz> User-Agent: Mutt/1.8.2 (2017-04-18) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Thu, 01 Jun 2017 13:45:24 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1373 Lines: 29 On Thu, Jun 01, 2017 at 10:09:09AM +0200, Michal Hocko wrote: > That is a bit surprising. I didn't think that the userfault syscall > (ioctl) can be faster than a regular #PF but considering that > __mcopy_atomic bypasses the page fault path and it can be optimized for > the anon case suggests that we can save some cycles for each page and so > the cumulative savings can be visible. __mcopy_atomic works not just for anonymous memory, hugetlbfs/shmem are covered too and there are branches to handle those. If you were to run more than one precopy pass UFFDIO_COPY shall become slower than the userland access starting from the second pass. At the light of this if CRIU can only do one single pass of precopy, CRIU is probably better off using UFFDIO_COPY than using prctl or madvise to temporarily turn off THP. With QEMU as opposed we set MADV_HUGEPAGE during precopy on destination to maximize the THP utilization for all those 2M naturally aligned guest regions that aren't re-dirtied in the source, so we're better off without using UFFDIO_COPY in precopy even during the first pass to avoid the enter/kernel for subpages that are written to destination in a already instantiated THP. At least until we teach QEMU to map 2M at once if possible (UFFDIO_COPY would then also require an enhancement, because currently it won't map THP on the fly). Thanks, Andrea