Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Subject: Re: [PATCH] hugetlbfs: Take read_lock on i_mmap for PMD sharing
To:     Mike Kravetz <mike.kravetz@oracle.com>,
        Matthew Wilcox <willy@infradead.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        linux-kernel@vger.kernel.org, linux-mm@kvack.org,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Will Deacon <will.deacon@arm.com>
References: <20191107190628.22667-1-longman@redhat.com>
 <20191107195441.GF11823@bombadil.infradead.org>
 <ed46ef09-7766-eb80-a4ad-4c72d8dba188@oracle.com>
 <20191108020456.sulyjskhq3s5zcaa@linux-p48b>
 <ea057d15-5205-9992-af95-b2727df577c4@oracle.com>
 <5059733e-95aa-2c9e-6f5d-4f45f6a130b3@oracle.com>
From:   Waiman Long <longman@redhat.com>
Organization: Red Hat
Message-ID: <fd29a337-c067-ebf6-4be2-3b6e2f703ac4@redhat.com>
Date:   Tue, 12 Nov 2019 12:27:40 -0500
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.7.2
MIME-Version: 1.0
In-Reply-To: <5059733e-95aa-2c9e-6f5d-4f45f6a130b3@oracle.com>
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On 11/8/19 8:47 PM, Mike Kravetz wrote:
> On 11/8/19 11:10 AM, Mike Kravetz wrote:
>> On 11/7/19 6:04 PM, Davidlohr Bueso wrote:
>>> On Thu, 07 Nov 2019, Mike Kravetz wrote:
>>>
>>>> Note that huge_pmd_share now increments the page count with the semaph=
ore
>>>> held just in read mode.  It is OK to do increments in parallel without
>>>> synchronization.  However, we don't want anyone else changing the coun=
t
>>>> while that check in huge_pmd_unshare is happening.  Hence, the need fo=
r
>>>> taking the semaphore in write mode.
>>> This would be a nice addition to the changelog methinks.
>> Last night I remembered there is one place where we currently take
>> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare.  That
>> is in try_to_unmap_one.  Yes, there is a potential race here today.
> Actually there is no race there today.  Callers to huge_pmd_unshare
> hold the page table lock.  So, this synchronizes those unshare calls
> from  page migration and page poisoning.
>
>> But that race is somewhat contained as you need two threads doing some
>> combination of page migration and page poisoning to race.  This change
>> now allows migration or poisoning to race with page fault.  I would
>> really prefer if we do not open up the race window in this manner.
> But, we do open a race window by changing huge_pmd_share to take the
> i_mmap_rwsem in read mode as in the original patch. =20
>
> Here is the additional code needed to take the semaphore in write mode
> for the huge_pmd_unshare calls via try_to_unmap_one.  We would need to
> combine this with Longman's patch.  Please take a look and provide feedba=
ck.
> Some of the changes are subtle, especially the exception for MAP_PRIVATE
> mappings, but I tried to add sufficient comments.
>
> From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Fri, 8 Nov 2019 17:25:37 -0800
> Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode bef=
ore
>  call to huge_pmd_unshare in try_to_unmap_one.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c        |  9 ++++++++-
>  mm/memory-failure.c | 28 +++++++++++++++++++++++++++-
>  mm/migrate.c        | 27 +++++++++++++++++++++++++--
>  3 files changed, 60 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f78891f92765..73d9136549a5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsign=
ed long addr, pud_t *pud)
>   * indicated by page_count > 1, unmap is achieved by clearing pud and
>   * decrementing the ref count. If count =3D=3D 1, the pte page is not sh=
ared.
>   *
> - * called with page table lock held.
> + * Must be called while holding page table lock.
> + * In general, the caller should also hold the i_mmap_rwsem in write mod=
e.
> + * This is to prevent races with page faults calling huge_pmd_share whic=
h
> + * will not be holding the page table lock, but will be holding i_mmap_r=
wsem
> + * in read mode.  It is possible to call without holding i_mmap_rwsem in
> + * write mode if the caller KNOWS the page table is associated with a pr=
ivate
> + * mapping.  This is because private mappings can not share PMDs and can
> + * not race with huge_pmd_share calls during page faults.

So the page table lock here is the huge_pte_lock(). Right? In
huge_pmd_share(), the pte lock has to be taken before one can share it.
So would you mind explaining where exactly is the race?

Thanks,
Longman

>   *
>   * returns: 1 successfully unmapped a shared pte page
>   *=09    0 the underlying pte page is not shared, or it is the last user
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3151c87dff73..8f52b22cf71b 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1030,7 +1030,33 @@ static bool hwpoison_user_mappings(struct page *p,=
 unsigned long pfn,
>  =09if (kill)
>  =09=09collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
> =20
> -=09unmap_success =3D try_to_unmap(hpage, ttu);
> +=09if (!PageHuge(hpage)) {
> +=09=09unmap_success =3D try_to_unmap(hpage, ttu);
> +=09} else {
> +=09=09mapping =3D page_mapping(hpage);
> +=09=09if (mapping) {
> +=09=09=09/*
> +=09=09=09 * For hugetlb pages, try_to_unmap could potentially
> +=09=09=09 * call huge_pmd_unshare.  Because of this, take
> +=09=09=09 * semaphore in write mode here and set TTU_RMAP_LOCKED
> +=09=09=09 * to indicate we have taken the lock at this higher
> +=09=09=09 * level.
> +=09=09=09 */
> +=09=09=09i_mmap_lock_write(mapping);
> +=09=09=09unmap_success =3D try_to_unmap(hpage,
> +=09=09=09=09=09=09=09ttu|TTU_RMAP_LOCKED);
> +=09=09=09i_mmap_unlock_write(mapping);
> +=09=09} else {
> +=09=09=09/*
> +=09=09=09 * !mapping implies a MAP_PRIVATE huge page mapping.
> +=09=09=09 * Since PMDs will never be shared in a private
> +=09=09=09 * mapping, it is safe to let huge_pmd_unshare be
> +=09=09=09 * called with the semaphore in read mode.
> +=09=09=09 */
> +=09=09=09unmap_success =3D try_to_unmap(hpage, ttu);
> +=09=09}
> +=09}
> +
>  =09if (!unmap_success)
>  =09=09pr_err("Memory failure: %#lx: failed to unmap page (mapcount=3D%d)=
\n",
>  =09=09       pfn, page_mapcount(hpage));
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 4fe45d1428c8..9cae5a4f1e48 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1333,8 +1333,31 @@ static int unmap_and_move_huge_page(new_page_t get=
_new_page,
>  =09=09goto put_anon;
> =20
>  =09if (page_mapped(hpage)) {
> -=09=09try_to_unmap(hpage,
> -=09=09=09TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
> +=09=09struct address_space *mapping =3D page_mapping(hpage);
> +
> +=09=09if (mapping) {
> +=09=09=09/*
> +=09=09=09 * try_to_unmap could potentially call huge_pmd_unshare.
> +=09=09=09 * Because of this, take semaphore in write mode here
> +=09=09=09 * and set TTU_RMAP_LOCKED to indicate we have taken
> +=09=09=09 * the lock at this higher level.
> +=09=09=09 */
> +=09=09=09i_mmap_lock_write(mapping);
> +=09=09=09try_to_unmap(hpage,
> +=09=09=09=09TTU_MIGRATION|TTU_IGNORE_MLOCK|
> +=09=09=09=09TTU_IGNORE_ACCESS|TTU_RMAP_LOCKED);
> +=09=09=09i_mmap_unlock_write(mapping);
> +=09=09} else {
> +=09=09=09/*
> +=09=09=09 * !mapping implies a MAP_PRIVATE huge page mapping.
> +=09=09=09 * Since PMDs will never be shared in a private
> +=09=09=09 * mapping, it is safe to let huge_pmd_unshare be
> +=09=09=09 * called with the semaphore in read mode.
> +=09=09=09 */
> +=09=09=09try_to_unmap(hpage,
> +=09=09=09=09TTU_MIGRATION|TTU_IGNORE_MLOCK|
> +=09=09=09=09TTU_IGNORE_ACCESS);
> +=09=09}
>  =09=09page_was_mapped =3D 1;
>  =09}
> =20