Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp584559imm; Thu, 6 Sep 2018 07:11:26 -0700 (PDT) X-Google-Smtp-Source: ANB0VdaUPXa8I5FDNO/kSU8y5utUjBie20VdBVK/jeq/BRdV/C7Y0oWdIuJXUVsBA4AlKRjnz9d+ X-Received: by 2002:a17:902:e004:: with SMTP id ca4-v6mr2758233plb.252.1536243085990; Thu, 06 Sep 2018 07:11:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1536243085; cv=none; d=google.com; s=arc-20160816; b=mZKohjeiNwPJroLaanFA1TFxJhekzXIfJ56TejkA9iYDaDq63NsoTMRnGxQL3bsPwo LC8uKKtCuRDDJFTFOaz3hKxxsB02/XD18eB+BlLFk5M1SZoK8TVVK9EqjN8xzJBs2K0D 497+Gxw+Ko4j7nGqLor5IM/7UG38xIJQx6OHAggXwRYn3M8oU4lelbtiUI/QAxMhOxYw XdwO21waa5gkMc/nHXjrP/R9oRMJWk46rMOO+lZPgmSF90vtbMEu3EKZHzCZtaVGhMFI aslHXdhCMfaTz45ol2SjDavNkc5PC6ZK98htM0kgTlV6OxhZpSZKDF0Z+2e+uSa5AvCD CACw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=LyM9Jp6Z4X4r/+Hio7tzTlyMGd+SYJgAYvEUsL5mWC4=; b=hkRW23MsM5DcRwziwqFq8O41QWBA7d5oxiYD91Q+hr8uTSPz0OMrAjKKC9B3FHsriA mN1vrehty+sVI6VDfmpjNq9lGpJ255uxLlMoqcYPbFHmxxl5DUxmoHISEoTPcZlVserR TyHDqTzfPtTZrQoBiWSu73oNAt7c9xo77+WC3IKvufTnLgXAVoOeM07sEY3M1xnF7z9w sA3jTvaMqombTAgiPXwQH3Y84cRhR46VJYNWvqpY7G/DBcltkgWMK2sU+jg6Ed7jkliv 4dHOj+EXk2RSbm4wz2AL/UEPzLzC+0j/fp/3wWHo/LS56UTmjdMF0nVjI7P3dkl9JCX2 KBzQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@shutemov-name.20150623.gappssmtp.com header.s=20150623 header.b="z/o+K1H9"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n28-v6si5367475pfg.127.2018.09.06.07.11.09; Thu, 06 Sep 2018 07:11:25 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@shutemov-name.20150623.gappssmtp.com header.s=20150623 header.b="z/o+K1H9"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729385AbeIFSob (ORCPT + 99 others); Thu, 6 Sep 2018 14:44:31 -0400 Received: from mail-pl1-f193.google.com ([209.85.214.193]:39453 "EHLO mail-pl1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728918AbeIFSob (ORCPT ); Thu, 6 Sep 2018 14:44:31 -0400 Received: by mail-pl1-f193.google.com with SMTP id w14-v6so5043700plp.6 for ; Thu, 06 Sep 2018 07:08:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov-name.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=LyM9Jp6Z4X4r/+Hio7tzTlyMGd+SYJgAYvEUsL5mWC4=; b=z/o+K1H9b9i68YREwVTkKxUf1Uo0zPuCR8yIU8rp8zKkQBNjMzNGE8vw5ifwY5NNlB aB6lLMp0bmPBk8ztjovuiRn0b3VYdRvA6vZv4ejCCccg/hyz0Tok4eLfLP0iN7yUsEDe wxO5ujqEZAmopp/CG9XITkW3eekIB+KgW4laf0QeaKaLU+gosaWfNUnpYvHnuVvVd98M +cXQXQbMPfv/j5ZNmWww2cRzs3hOT8IVOKlMoBqcSHE5chxpxsTGPExmLSe4MiE77949 tsmiO7AKsr9dgDCT7RyhYZH9uP9FUSzjIt9p03VftI6EdT5/C45lmuQNS3RDwo4HluoL eLKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=LyM9Jp6Z4X4r/+Hio7tzTlyMGd+SYJgAYvEUsL5mWC4=; b=BzvY0NbwRVMPtN2OjBqBR7BlTawnIzWYMyxa4j02Ajr4QY+vpJYf0Sqrq/1T9SWPI5 y+aM07baMe0Dc8ZdfjUIlbeF3B3WF+LOIdCaunxRwvMxj+ygaffblUAaK6wv3gFWl+rC jAQ9b847LrSnBLcQDc2GLv3hVUl7RV2WuuqaCD2XARrOUqULVjnyBl2UXPxYyQXw2e4Y IKEuZro4gu/Hya6SsdkoQJv+Qd1pziEWkeIBtBJPAByVmq3NzsxBu7AZ/Q7rocZjIv/G pVsJRg+p501vDEuQK+DaNoEvwd3V9ifazwh4N6S8FtPsVZnM4s6oFGg7PBl8Z/nSuxvI +MUg== X-Gm-Message-State: APzg51CnJ7ei/KpQI5TT02+B8mMfq9IuZ3gxrieu28rj40m03UTelR44 F6iAQQOgGAGGdvnqGQUrWmsBDg== X-Received: by 2002:a17:902:2ac3:: with SMTP id j61-v6mr2830779plb.172.1536242927993; Thu, 06 Sep 2018 07:08:47 -0700 (PDT) Received: from kshutemo-mobl1.localdomain ([134.134.139.83]) by smtp.gmail.com with ESMTPSA id o10-v6sm9592756pfk.76.2018.09.06.07.08.46 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 06 Sep 2018 07:08:46 -0700 (PDT) Received: by kshutemo-mobl1.localdomain (Postfix, from userid 1000) id 6F90D300DA2; Thu, 6 Sep 2018 17:08:42 +0300 (+03) Date: Thu, 6 Sep 2018 17:08:42 +0300 From: "Kirill A. Shutemov" To: Peter Xu Cc: Zi Yan , linux-kernel@vger.kernel.org, Andrea Arcangeli , Andrew Morton , Michal Hocko , Huang Ying , Dan Williams , Naoya Horiguchi , =?utf-8?B?SsOpcsO0bWU=?= Glisse , "Aneesh Kumar K.V" , Konstantin Khlebnikov , Souptick Joarder , linux-mm@kvack.org Subject: Re: [PATCH] mm: hugepage: mark splitted page dirty when needed Message-ID: <20180906140842.jzf7tluzocb5nv3f@kshutemo-mobl1> References: <20180904075510.22338-1-peterx@redhat.com> <20180904080115.o2zj4mlo7yzjdqfl@kshutemo-mobl1> <20180905073037.GA23021@xz-x1> <20180905125522.x2puwfn5sr2zo3go@kshutemo-mobl1> <20180906113933.GG16937@xz-x1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180906113933.GG16937@xz-x1> User-Agent: NeoMutt/20180716 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 06, 2018 at 07:39:33PM +0800, Peter Xu wrote: > On Wed, Sep 05, 2018 at 03:55:22PM +0300, Kirill A. Shutemov wrote: > > On Wed, Sep 05, 2018 at 03:30:37PM +0800, Peter Xu wrote: > > > On Tue, Sep 04, 2018 at 10:00:28AM -0400, Zi Yan wrote: > > > > On 4 Sep 2018, at 4:01, Kirill A. Shutemov wrote: > > > > > > > > > On Tue, Sep 04, 2018 at 03:55:10PM +0800, Peter Xu wrote: > > > > >> When splitting a huge page, we should set all small pages as dirty if > > > > >> the original huge page has the dirty bit set before. Otherwise we'll > > > > >> lose the original dirty bit. > > > > > > > > > > We don't lose it. It got transfered to struct page flag: > > > > > > > > > > if (pmd_dirty(old_pmd)) > > > > > SetPageDirty(page); > > > > > > > > > > > > > Plus, when split_huge_page_to_list() splits a THP, its subroutine __split_huge_page() > > > > propagates the dirty bit in the head page flag to all subpages in __split_huge_page_tail(). > > > > > > Hi, Kirill, Zi, > > > > > > Thanks for your responses! > > > > > > Though in my test the huge page seems to be splitted not by > > > split_huge_page_to_list() but by explicit calls to > > > change_protection(). The stack looks like this (again, this is a > > > customized kernel, and I added an explicit dump_stack() there): > > > > > > kernel: dump_stack+0x5c/0x7b > > > kernel: __split_huge_pmd+0x192/0xdc0 > > > kernel: ? update_load_avg+0x8b/0x550 > > > kernel: ? update_load_avg+0x8b/0x550 > > > kernel: ? account_entity_enqueue+0xc5/0xf0 > > > kernel: ? enqueue_entity+0x112/0x650 > > > kernel: change_protection+0x3a2/0xab0 > > > kernel: mwriteprotect_range+0xdd/0x110 > > > kernel: userfaultfd_ioctl+0x50b/0x1210 > > > kernel: ? do_futex+0x2cf/0xb20 > > > kernel: ? tty_write+0x1d2/0x2f0 > > > kernel: ? do_vfs_ioctl+0x9f/0x610 > > > kernel: do_vfs_ioctl+0x9f/0x610 > > > kernel: ? __x64_sys_futex+0x88/0x180 > > > kernel: ksys_ioctl+0x70/0x80 > > > kernel: __x64_sys_ioctl+0x16/0x20 > > > kernel: do_syscall_64+0x55/0x150 > > > kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > > > At the very time the userspace is sending an UFFDIO_WRITEPROTECT ioctl > > > to kernel space, which is handled by mwriteprotect_range(). In case > > > you'd like to refer to the kernel, it's basically this one from > > > Andrea's (with very trivial changes): > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git userfault > > > > > > So... do we have two paths to split the huge pages separately? > > > > We have two entiries that can be split: page table enties and underlying > > compound page. > > > > split_huge_pmd() (and variants of it) split the PMD entry into a PTE page > > table. It doens't touch underlying compound page. The page still can be > > mapped in other place as huge. > > > > split_huge_page() (and ivariants of it) split compound page into a number > > of 4k (or whatever PAGE_SIZE is). The operation requires splitting all > > PMD, but not other way around. > > > > > > > > Another (possibly very naive) question is: could any of you hint me > > > how the page dirty bit is finally applied to the PTEs? These two > > > dirty flags confused me for a few days already (the SetPageDirty() one > > > which sets the page dirty flag, and the pte_mkdirty() which sets that > > > onto the real PTEs). > > > > Dirty bit from page table entries transferes to sturct page flug and used > > for decision making in reclaim path. > > Thanks for explaining. It's much clearer for me. > > Though for the issue I have encountered, I am still confused on why > that dirty bit can be ignored for the splitted PTEs. Indeed we have: > > if (pmd_dirty(old_pmd)) > SetPageDirty(page); > > However to me this only transfers (as you explained above) the dirty > bit (AFAIU it's possibly set by the hardware when the page is written) > to the page struct of the compound page. It did not really apply to > every small page of the splitted huge page. As you also explained, > this __split_huge_pmd() only splits the PMD entry but it keeps the > compound huge page there, then IMHO it should also apply the dirty > bits from the huge page to all the small page entries, no? The bit on compound page represents all small subpages. PageDirty() on any subpage will return you true if the compound page is dirty. > These dirty bits are really important to my scenario since AFAIU the > change_protection() call is using these dirty bits to decide whether > it should append the WRITE bit - it finally corresponds to the lines > in change_pte_range(): > > /* Avoid taking write faults for known dirty pages */ > if (dirty_accountable && pte_dirty(ptent) && > (pte_soft_dirty(ptent) || > !(vma->vm_flags & VM_SOFTDIRTY))) { > ptent = pte_mkwrite(ptent); > } > > So when mprotect() with that range (my case is UFFDIO_WRITEPROTECT, > which is similar) although we pass in the new protocol with VM_WRITE > here it'll still mask it since the dirty bit is not set, then the > userspace program (in my case, the QEMU thread that handles write > protect failures) can never fixup the write-protected page fault. I don't follow here. The code you quoting above is an apportunistic optimization and should not be mission-critical. The dirty and writable bits can go away as soon as you drop page table lock for the page. -- Kirill A. Shutemov