Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp4515421ybn; Sat, 28 Sep 2019 02:07:29 -0700 (PDT) X-Google-Smtp-Source: APXvYqxKHNGOS6P1phau2DwsQ5qN9Klkdf+9bX4Y+vgdkeisPwfQKM5CuRDhHLVB61ZtTuBKgxcQ X-Received: by 2002:aa7:dc57:: with SMTP id g23mr9102973edu.38.1569661649212; Sat, 28 Sep 2019 02:07:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569661649; cv=none; d=google.com; s=arc-20160816; b=Lh2ycnnUs2xeuIBi17cc8Jr20ndPDodJxB+uFeplYTNWLS+YOtGpisoi9vKi5obOx3 /5cmXPS1CtDpzG7BYdR7o2hk7J6WkSsQlO5Xq9jPEpdUzm4EDNvNN7NlAIA+Rzjd6CqN UseiGygDy24UUQ7ZW6PVfF3eSFYky5SntVJFK/KHekbxQT65pjAFujg4olZ8v8+gZ/bA QTHic1oFtHwHzgbBCvYm+10Z5UWTZMyzGoWCqUSonyeFxfxgGD/S5uR4wx5JBVOv/Q3p 6IaGOzF/yhq6Oje5QxwOiUuppdfo5G400p0KcB50paiAP1QBGiEdXa/n5u0LAmr2GfRW IwAw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:organization:autocrypt:openpgp:from:references:cc:to :subject; bh=x94+zUuK2HxOfQUYqyu/L97y0q3oJ9Ux/s2P8u06dOc=; b=bghP2+W26BtTgPgIEwsx4eEJc9omKJm2FIvr2seZBKEVJ2O8MWF7+Tt2zvNacvk+cI Rb9QAGAP3+5jmpa4UtTtocPqSCvFMK7O5tOWOUWLmwJoeJoO7Kcm3iLUJBR3ma9cDFrO nPVCTdY4NameAyIqO0oSt8UYWSSDS03zHZlr0H/n7OnujmXbWSSKNsGUf7W34q+XdhG8 y/P2c6tjeA/4oy7wnyEJQyn/Q7ZU5sfSSctSUgpAhnw2HqenYA0DmHtLD6eO4wLtZGHI rpdLrmNH0OrGQLb9zO8TCY0yzDzLwrWFuwdPYTx6Nw5EGdELc/P3jhiUQ7AmGlWkYc/U dKkg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i9si3875839ejy.106.2019.09.28.02.07.03; Sat, 28 Sep 2019 02:07:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726581AbfI1JHA (ORCPT + 99 others); Sat, 28 Sep 2019 05:07:00 -0400 Received: from mx1.redhat.com ([209.132.183.28]:58230 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725857AbfI1JHA (ORCPT ); Sat, 28 Sep 2019 05:07:00 -0400 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 5BE4B3099F56; Sat, 28 Sep 2019 09:06:59 +0000 (UTC) Received: from [10.36.116.71] (ovpn-116-71.ams2.redhat.com [10.36.116.71]) by smtp.corp.redhat.com (Postfix) with ESMTP id 693B05D9C3; Sat, 28 Sep 2019 09:06:57 +0000 (UTC) Subject: Re: [PATCH] mm/page_alloc: fix a crash in free_pages_prepare() To: Alexander Duyck , Andrew Morton Cc: Qian Cai , Heiko Carstens , gor@linux.ibm.com, borntraeger@de.ibm.com, linux-s390@vger.kernel.org, linux-mm , LKML References: <1569613623-16820-1-git-send-email-cai@lca.pw> <20190927140222.6f7d0a41b9e734053ee911b9@linux-foundation.org> <1569619686.5576.242.camel@lca.pw> <20190927145945.846a3f3405d3af066827d3f5@linux-foundation.org> From: David Hildenbrand Openpgp: preference=signencrypt Autocrypt: addr=david@redhat.com; prefer-encrypt=mutual; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwX4EEwECACgFAljj9eoCGwMFCQlmAYAGCwkI BwMCBhUIAgkKCwQWAgMBAh4BAheAAAoJEE3eEPcA/4Na5IIP/3T/FIQMxIfNzZshIq687qgG 8UbspuE/YSUDdv7r5szYTK6KPTlqN8NAcSfheywbuYD9A4ZeSBWD3/NAVUdrCaRP2IvFyELj xoMvfJccbq45BxzgEspg/bVahNbyuBpLBVjVWwRtFCUEXkyazksSv8pdTMAs9IucChvFmmq3 jJ2vlaz9lYt/lxN246fIVceckPMiUveimngvXZw21VOAhfQ+/sofXF8JCFv2mFcBDoa7eYob s0FLpmqFaeNRHAlzMWgSsP80qx5nWWEvRLdKWi533N2vC/EyunN3HcBwVrXH4hxRBMco3jvM m8VKLKao9wKj82qSivUnkPIwsAGNPdFoPbgghCQiBjBe6A75Z2xHFrzo7t1jg7nQfIyNC7ez MZBJ59sqA9EDMEJPlLNIeJmqslXPjmMFnE7Mby/+335WJYDulsRybN+W5rLT5aMvhC6x6POK z55fMNKrMASCzBJum2Fwjf/VnuGRYkhKCqqZ8gJ3OvmR50tInDV2jZ1DQgc3i550T5JDpToh dPBxZocIhzg+MBSRDXcJmHOx/7nQm3iQ6iLuwmXsRC6f5FbFefk9EjuTKcLMvBsEx+2DEx0E UnmJ4hVg7u1PQ+2Oy+Lh/opK/BDiqlQ8Pz2jiXv5xkECvr/3Sv59hlOCZMOaiLTTjtOIU7Tq 7ut6OL64oAq+zsFNBFXLn5EBEADn1959INH2cwYJv0tsxf5MUCghCj/CA/lc/LMthqQ773ga uB9mN+F1rE9cyyXb6jyOGn+GUjMbnq1o121Vm0+neKHUCBtHyseBfDXHA6m4B3mUTWo13nid 0e4AM71r0DS8+KYh6zvweLX/LL5kQS9GQeT+QNroXcC1NzWbitts6TZ+IrPOwT1hfB4WNC+X 2n4AzDqp3+ILiVST2DT4VBc11Gz6jijpC/KI5Al8ZDhRwG47LUiuQmt3yqrmN63V9wzaPhC+ xbwIsNZlLUvuRnmBPkTJwwrFRZvwu5GPHNndBjVpAfaSTOfppyKBTccu2AXJXWAE1Xjh6GOC 8mlFjZwLxWFqdPHR1n2aPVgoiTLk34LR/bXO+e0GpzFXT7enwyvFFFyAS0Nk1q/7EChPcbRb hJqEBpRNZemxmg55zC3GLvgLKd5A09MOM2BrMea+l0FUR+PuTenh2YmnmLRTro6eZ/qYwWkC u8FFIw4pT0OUDMyLgi+GI1aMpVogTZJ70FgV0pUAlpmrzk/bLbRkF3TwgucpyPtcpmQtTkWS gDS50QG9DR/1As3LLLcNkwJBZzBG6PWbvcOyrwMQUF1nl4SSPV0LLH63+BrrHasfJzxKXzqg rW28CTAE2x8qi7e/6M/+XXhrsMYG+uaViM7n2je3qKe7ofum3s4vq7oFCPsOgwARAQABwsFl BBgBAgAPBQJVy5+RAhsMBQkJZgGAAAoJEE3eEPcA/4NagOsP/jPoIBb/iXVbM+fmSHOjEshl KMwEl/m5iLj3iHnHPVLBUWrXPdS7iQijJA/VLxjnFknhaS60hkUNWexDMxVVP/6lbOrs4bDZ NEWDMktAeqJaFtxackPszlcpRVkAs6Msn9tu8hlvB517pyUgvuD7ZS9gGOMmYwFQDyytpepo YApVV00P0u3AaE0Cj/o71STqGJKZxcVhPaZ+LR+UCBZOyKfEyq+ZN311VpOJZ1IvTExf+S/5 lqnciDtbO3I4Wq0ArLX1gs1q1XlXLaVaA3yVqeC8E7kOchDNinD3hJS4OX0e1gdsx/e6COvy qNg5aL5n0Kl4fcVqM0LdIhsubVs4eiNCa5XMSYpXmVi3HAuFyg9dN+x8thSwI836FoMASwOl C7tHsTjnSGufB+D7F7ZBT61BffNBBIm1KdMxcxqLUVXpBQHHlGkbwI+3Ye+nE6HmZH7IwLwV W+Ajl7oYF+jeKaH4DZFtgLYGLtZ1LDwKPjX7VAsa4Yx7S5+EBAaZGxK510MjIx6SGrZWBrrV TEvdV00F2MnQoeXKzD7O4WFbL55hhyGgfWTHwZ457iN9SgYi1JLPqWkZB0JRXIEtjd4JEQcx +8Umfre0Xt4713VxMygW0PnQt5aSQdMD58jHFxTk092mU+yIHj5LeYgvwSgZN4airXk5yRXl SE+xAvmumFBY Organization: Red Hat GmbH Message-ID: <5ee9164f-71c5-4082-a80d-8fbc5dc50750@redhat.com> Date: Sat, 28 Sep 2019 11:06:56 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.41]); Sat, 28 Sep 2019 09:06:59 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 28.09.19 00:17, Alexander Duyck wrote: > On Fri, Sep 27, 2019 at 2:59 PM Andrew Morton wrote: >> >> On Fri, 27 Sep 2019 17:28:06 -0400 Qian Cai wrote: >> >>>> >>>> So I think you've moved the arch_free_page() to be after the final >>>> thing which can access page contents, yes? If so, we should have a >>>> comment in free_pages_prepare() to attmept to prevent this problem from >>>> reoccurring as the code evolves? >>> >>> Right, something like this above arch_free_page() there? >>> >>> /* >>> * It needs to be just above kernel_map_pages(), as s390 could mark those >>> * pages unused and then trigger a fault when accessing. >>> */ >> >> I did this. >> >> --- a/mm/page_alloc.c~mm-page_alloc-fix-a-crash-in-free_pages_prepare-fix >> +++ a/mm/page_alloc.c >> @@ -1179,7 +1179,13 @@ static __always_inline bool free_pages_p >> kernel_init_free_pages(page, 1 << order); >> >> kernel_poison_pages(page, 1 << order, 0); >> + /* >> + * arch_free_page() can make the page's contents inaccessible. s390 >> + * does this. So nothing which can access the page's contents should >> + * happen after this. >> + */ >> arch_free_page(page, order); >> + >> if (debug_pagealloc_enabled()) >> kernel_map_pages(page, 1 << order, 0); >> > > So the question I would have is what is the state of the page after it > has been marked unused and then pulled back in? I'm assuming it will > be all 0s. I think this comment relates to the s390x implementation, so I'll try to explain that. After arch_free_page() the page might have been zapped in the hypervisor, but that might happen deferred. The guest ends up triggering the ESSA instruction in arch_free_page(). That instruction sets some extended-page-table-related ("pgste") bits in the hypervisor tables for the guest ("gmap") and fills a buffer with these entries. The page is marked _PGSTE_GPS_USAGE_UNUSED. Once the buffer is full, the ESSA instruction intercepts to the hypervisor, where the hypervisor can go over all recorded entries and zap them *in case* the extended-page-table-related bits still indicate that the page is unused by the guest (_PGSTE_GPS_USAGE_UNUSED) or is marked to be logically zero (_PGSTE_GPS_ZERO). Zapping a page only happens if it's a pte_swap(pte) entry and effectively triggers a ptep_zap_unused() -> ptep_zap_swap_entry() -> free_swap_and_cache(). So I think it will be backed with the zero page when pulled back in. arch_alloc_page() will similarly trigger the ESSA instruction but only set the extended-page-table-related bits, so the entry is no longer _PGSTE_GPS_USAGE_UNUSED. This is basically to make sure a page won't get zapped in the hypervisor while it is already getting used by the guest again. The implementation on the KVM side resides in arch/s390/kvm/priv.c:handle_essa() but more importantly in arch/s390/kvm/priv.c:__do_essa() ("pure interpretation path skipping hardware interpretation completely"). Now, one interesting thing resides in arch/s390/kvm/priv.c:pgste_perform_essa(): /* If we are discarding a page, set it to logical zero */ if (res) pgstev |= _PGSTE_GPS_ZERO; So whenever we do an arch_free_page() in the guest, the page will immediately also be set in the hypervisor to _PGSTE_GPS_ZERO. However, I think setting the page logically to zero is just an "extended HW state" and will not actually result in the page reading zeroes before we actually zap it. I might be wrong and I only see one place where _PGSTE_GPS_ZERO actually gets cleared again, especially when setting a page stable (which looks bogus but as the documentation is confidential I have no idea what's happening there). Long story short: I think there is *no guarantee* that a) After arch_free_page(), the page is actually zeroed-out b) After arch_free_page() the page has been zapped in the hypervisor or will get zapped. c) After arch_alloc_page(), the page was actually zeroed-out. I might be wrong, depending on how _PGSTE_GPS_ZERO is actually used. However, *if* the page was zapped in the hypervisor (free_swap_and_cache()), I think it will get populated using the zero-page. Also, please not that s390x requires an arch_alloc_page() after an arch_free_page(). You cannot simply go ahead and reuse the page after arch_alloc_page(). > > I know with the work I am still doing on marking pages as unused this > ends up being an additional item that we will need to pay attention > to, however in our case we will just be initializing the page as zero > if we end up evicting it from the guest. Please note that if you are using MADV_FREE instead of MADV_DONTNEED in the hypervisor, you might end up with the same guarantees that s390x' implementation gives you. Could be, that the page was not zapped/zeroed out on the next access. Depends on if the hypervisor was feeling like zapping entries marked using MADV_FREE. -- Thanks, David / dhildenb