Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp1161734ybl; Thu, 12 Dec 2019 10:33:23 -0800 (PST) X-Google-Smtp-Source: APXvYqyU2FFGnGVeVNQ0yQ/WHuWAYjl/0rWOHe1i1xTmxlq9YtTRInYwvepm9XfYdEhabojzPcXM X-Received: by 2002:a05:6830:2111:: with SMTP id i17mr9123693otc.24.1576175603554; Thu, 12 Dec 2019 10:33:23 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1576175603; cv=none; d=google.com; s=arc-20160816; b=BJUBzP9R/utPBkXM76Xqf2aOQhm/uRbQhjczvQHp+e6FTy0J6yjzHqSk7DebfIZO7A wBo+1s3kffQtGD8sGIknviagXP/pVJ+DP7oI9blcOWx/bISiM8pCv823CL0JlC5lAqtd YK8eM5DGoyL9VJNDPBlFb7Oxh5fX0g18ipCZYsHUe/0mKL5OA7YEZiE8PAIgFQm5cUaY sMLc6/DgmTlDmRZAltinxDHQHQcb0GMZys7EMKhx8Ogtl3aefsdf4mlHBY2n6+JR+os3 MfUPNsEOFBsi16LGjcNOm7sD9nrwwGOP5XKXohRJOjitAUmv86L5+LDT0+JHmcatNuNE fwXA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:to:references:message-id :content-transfer-encoding:cc:date:in-reply-to:from:subject :mime-version:dkim-signature; bh=VNKl5CWbdSpN+Y7fCNta6wdQX8x1ddy6iAvSnWfUssY=; b=acwm86F+Gr7hiNTtcai8uWdY6/wi4zRH+vRMhokTjDwZdupO5t0LQ1KAZMO0C8dr/i 3Q79g9yCkBi1Ydr+y7mNrE7/OZ0IZ+6KbEeoKV2WX+tm6kFVlKiYxc+dMTG7ubnAKhPx Is0UTsI3GQaJmRBQPdFZJpzddvDruO1q4EMHfKFMQSfQiiw4SClSikNlcwsskJKV5tj9 4WHPVRR+ps2ahkm+c/N7pA5mSRlrVgzYybsIXK7QGkRIO484yBrO5jtAzlKuZHz4jvN0 gEmJlw9cwtwqh2hRcn4BCjhRtsYK0PhNX25Uqu+TGJoI7A8nYRK/vkRFxrPtC3h3Z2lV VptQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2019-08-05 header.b=I6F3qqoX; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q16si4050907otk.226.2019.12.12.10.33.09; Thu, 12 Dec 2019 10:33:23 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2019-08-05 header.b=I6F3qqoX; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730409AbfLLScd (ORCPT + 99 others); Thu, 12 Dec 2019 13:32:33 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:38284 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730348AbfLLScc (ORCPT ); Thu, 12 Dec 2019 13:32:32 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.27/8.16.0.27) with SMTP id xBCIL1Y9048633; Thu, 12 Dec 2019 18:32:25 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=content-type : mime-version : subject : from : in-reply-to : date : cc : content-transfer-encoding : message-id : references : to; s=corp-2019-08-05; bh=VNKl5CWbdSpN+Y7fCNta6wdQX8x1ddy6iAvSnWfUssY=; b=I6F3qqoXqHAHEXfCss0+urIzy+fVDLvbYUoR6WTt1/zaJyXu5zkYVrNjt6htgFhl3Lrp Y/dPbSU0TOJPFKt/bmFXYIrUlDabL6z5skYUdrW9SHCOJvkW0RGAHdVl4KdMFJidJPdb oV7Xfa7fhZf4nWlLdd6OIwFSXNrMxz0qe2THcLZyny8vIZXboJygZqeEI8pvf9iNqHbG 9vJVue4uRPQHXuWExKYuWe8zLY6ubFxbC3Ro0JOfNl4+aDYlrBeoJGVEeBF1mkn/uzpR SiijjmLtLCD9wz78zkPMPTJBuGGR82uEOSRsi6SSUU5YBZvnAoro3c6j+DLFuLlwsLpO uQ== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by userp2120.oracle.com with ESMTP id 2wr4qrvr8n-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 12 Dec 2019 18:32:25 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.27/8.16.0.27) with SMTP id xBCILNfq101182; Thu, 12 Dec 2019 18:32:24 GMT Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserp3020.oracle.com with ESMTP id 2wumw0vm11-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 12 Dec 2019 18:32:24 +0000 Received: from abhmp0005.oracle.com (abhmp0005.oracle.com [141.146.116.11]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id xBCIWMmE005263; Thu, 12 Dec 2019 18:32:22 GMT Received: from [192.168.14.112] (/109.65.223.49) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 12 Dec 2019 10:32:22 -0800 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\)) Subject: Re: [PATCH v4 2/2] kvm: Use huge pages for DAX-backed files From: Liran Alon In-Reply-To: Date: Thu, 12 Dec 2019 20:32:17 +0200 Cc: Barret Rhoden , Paolo Bonzini , David Hildenbrand , Dave Jiang , Alexander Duyck , linux-nvdimm , X86 ML , KVM list , Linux Kernel Mailing List , "Zeng, Jason" Content-Transfer-Encoding: quoted-printable Message-Id: <5F859A55-C964-4362-9A25-3F4BA72E7326@oracle.com> References: <20191211213207.215936-1-brho@google.com> <20191211213207.215936-3-brho@google.com> <376DB19A-4EF1-42BF-A73C-741558E397D4@oracle.com> To: Dan Williams X-Mailer: Apple Mail (2.3445.4.7) X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9469 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1911140001 definitions=main-1912120141 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9469 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1911140001 definitions=main-1912120141 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On 12 Dec 2019, at 19:59, Dan Williams = wrote: >=20 > On Thu, Dec 12, 2019 at 9:39 AM Liran Alon = wrote: >>=20 >>=20 >>=20 >>> On 12 Dec 2019, at 18:54, Dan Williams = wrote: >>>=20 >>> On Thu, Dec 12, 2019 at 4:34 AM Liran Alon = wrote: >>>>=20 >>>>=20 >>>>=20 >>>>> On 11 Dec 2019, at 23:32, Barret Rhoden wrote: >>>>>=20 >>>>> This change allows KVM to map DAX-backed files made of huge pages = with >>>>> huge mappings in the EPT/TDP. >>>>>=20 >>>>> DAX pages are not PageTransCompound. The existing check is trying = to >>>>> determine if the mapping for the pfn is a huge mapping or not. = For >>>>> non-DAX maps, e.g. hugetlbfs, that means checking = PageTransCompound. >>>>> For DAX, we can check the page table itself. >>>>=20 >>>> For hugetlbfs pages, tdp_page_fault() -> mapping_level() -> = host_mapping_level() -> kvm_host_page_size() -> vma_kernel_pagesize() >>>> will return the page-size of the hugetlbfs without the need to = parse the page-tables. >>>> See vma->vm_ops->pagesize() callback implementation at = hugetlb_vm_ops->pagesize()=3D=3Dhugetlb_vm_op_pagesize(). >>>>=20 >>>> Only for pages that were originally mapped as small-pages and later = merged to larger pages by THP, there is a need to check for = PageTransCompound(). Again, instead of parsing page-tables. >>>>=20 >>>> Therefore, it seems more logical to me that: >>>> (a) If DAX-backed files are mapped as large-pages to userspace, it = should be reflected in vma->vm_ops->page_size() of that mapping. Causing = kvm_host_page_size() to return the right size without the need to parse = the page-tables. >>>=20 >>> A given dax-mapped vma may have mixed page sizes so ->page_size() >>> can't be used reliably to enumerating the mapping size. >>=20 >> Naive question: Why don=E2=80=99t split the VMA in this case to = multiple VMAs with different results for ->page_size()? >=20 > Filesystems traditionally have not populated ->pagesize() in their > vm_operations, there was no compelling reason to go add it and the > complexity seems prohibitive. I understand. Though this is technical debt that breaks ->page_size() = semantics which might cause a complex bug some day... >=20 >> What you are describing sounds like DAX is breaking this callback = semantics in an unpredictable manner. >=20 > It's not unpredictable. vma_kernel_pagesize() returns PAGE_SIZE. Of course. :) I meant it may be unexpected by the caller. > Huge > pages in the page cache has a similar issue. Ok. I haven=E2=80=99t known that. Thanks for the explanation. >=20 >>>> (b) If DAX-backed files small-pages can be later merged to = large-pages by THP, then the =E2=80=9Cstruct page=E2=80=9D of these = pages should be modified as usual to make PageTransCompound() return = true for them. I=E2=80=99m not highly familiar with this mechanism, but = I would expect THP to be able to merge DAX-backed files small-pages to = large-pages in case DAX provides =E2=80=9Cstruct page=E2=80=9D for the = DAX pages. >>>=20 >>> DAX pages do not participate in THP and do not have the >>> PageTransCompound accounting. The only mechanism that records the >>> mapping size for dax is the page tables themselves. >>=20 >> What is the rational behind this? Given that DAX pages can be = described with =E2=80=9Cstruct page=E2=80=9D (i.e. ZONE_DEVICE), what = prevents THP from manipulating page-tables to merge multiple DAX PFNs to = a larger page? >=20 > THP accounting is a function of the page allocator. ZONE_DEVICE pages > are excluded from the page allocator. ZONE_DEVICE is just enough > infrastructure to support pfn_to_page(), page_address(), and > get_user_pages(). Other page allocator services beyond that are not > present. Ok.