Received: by 2002:a05:6a10:6006:0:0:0:0 with SMTP id w6csp1221895pxa; Fri, 28 Aug 2020 07:07:55 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwKV9QI4qdxJ9ror5yLVeBL1ihQKNJUCnLrYxjkJmGiCgi/a5nafF0lD1Da0kDAYqKCMDzl X-Received: by 2002:a17:907:262e:: with SMTP id aq14mr2143218ejc.79.1598623675556; Fri, 28 Aug 2020 07:07:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1598623675; cv=none; d=google.com; s=arc-20160816; b=vUv+oStjV9EJoQY+tVPe2AF/5aWWs2+8WWd4Nx55LzCQgsehXpkfVYCRHVXwMhmg/3 VmCw0e0AA4aLPZBhv/CwTPSgYjGPZmrYWk7FHgSIE+rE5Cw3dXAU2pdX6YDrOnxbKWmz BZgXwovHX+5iBEoGBSNs31rSzvf8Uk1wT7QkWGBYHEpv29p9kTGE8Dl3Ma7mp3lgPUQo 899/nssKOZHZIBwgjZ8RtXSY3z/Sjq2NvFST4lYNyNOL0MDfqbrMgQuhO5LyFTV4t329 fXlO1XSpjE9o12sKg+jgsdbHrcZPTALBfrDQGcPn0HUSqozwd2/x0WQuUxlQcukFrSpD WByg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :dkim-signature; bh=ERoOd0Q/7NlQ0VKtxMcS5uxz9ImhNBo0Vidm9kzUOYo=; b=dHn+RVRfnYi+/o030sz//NzliatmhSRr8NbJo3ELjTwsDv23ZK8uQ6xj/8tEASXdRo FVH5L5BbDGFVJsnD3DDf9ErYOEHlA4HjoP6fmgLRgJpUI7fpvMmuXQX06sBlhdGvS7ze f7rPrs1shuQYCGmIKwrR1wWBEysiMTdWYzdPJEMX70vAZG6IxqdvxwOVUfn3CXO1Ybte abexkMh0xa0wAcLYmqqz5cdhrRiYpXUkSk0mlYsCkEgfJNNXGdeALsU8ZYcgfgy70Kwk +lVaOJVsg2FhN+DFl15FJXzUAQnXAGeMR8Wy29yRdBWj4y9BbXNCHjFBi3mODrpCRQMd 6DwA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=InVcVV9y; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id p24si722888ejf.495.2020.08.28.07.07.32; Fri, 28 Aug 2020 07:07:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=InVcVV9y; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726141AbgH1OEP (ORCPT + 99 others); Fri, 28 Aug 2020 10:04:15 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:34312 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727108AbgH1OEE (ORCPT ); Fri, 28 Aug 2020 10:04:04 -0400 Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 07SE3ZrU029648; Fri, 28 Aug 2020 10:03:43 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=from : to : cc : subject : date : message-id; s=pp1; bh=ERoOd0Q/7NlQ0VKtxMcS5uxz9ImhNBo0Vidm9kzUOYo=; b=InVcVV9yMS6plnwQJg8J6G7DpMxcIkh84j2ZVOt1YvVl0c7xcoGNXFSz3TuvYqeunij0 vKY0noKsXZpf10GS5rCeaka48hCedfqt7WuLyhMTOGEubPF98du7Mt680ZS0lmUadf4f QK65zhg8B+Z/eSmtfavk3huobMUdMi1xM6YS9FUPog2VSjXp9offRbJkFIFaM82ZnNIi S92QxFzsD5Azp0pdh00M5Okt6W7c/PoqH11P6vYGfEZCvAW4vOciIAluaND89GVY1/Vh rmMJL35pZ64ltJL/ndvHLhly9mNnCi7Z5MV7MArK8xK8294edCI2eqMrgX9jenHLE2PW JA== Received: from ppma02fra.de.ibm.com (47.49.7a9f.ip4.static.sl-reverse.com [159.122.73.71]) by mx0a-001b2d01.pphosted.com with ESMTP id 3371cabu5u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 28 Aug 2020 10:03:42 -0400 Received: from pps.filterd (ppma02fra.de.ibm.com [127.0.0.1]) by ppma02fra.de.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 07SE2YTY003363; Fri, 28 Aug 2020 14:03:40 GMT Received: from b06avi18626390.portsmouth.uk.ibm.com (b06avi18626390.portsmouth.uk.ibm.com [9.149.26.192]) by ppma02fra.de.ibm.com with ESMTP id 336fvdrkfw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 28 Aug 2020 14:03:40 +0000 Received: from d06av24.portsmouth.uk.ibm.com (mk.ibm.com [9.149.105.60]) by b06avi18626390.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 07SE26SV49742244 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 28 Aug 2020 14:02:06 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 768C742047; Fri, 28 Aug 2020 14:03:37 +0000 (GMT) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 12BCA42045; Fri, 28 Aug 2020 14:03:37 +0000 (GMT) Received: from tuxmaker.boeblingen.de.ibm.com (unknown [9.152.85.9]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Fri, 28 Aug 2020 14:03:37 +0000 (GMT) From: Gerald Schaefer To: Linus Torvalds , Andrew Morton Cc: linux-mm , LKML , Vasily Gorbik , Alexander Gordeev , linux-s390@vger.kernel.org, Heiko Carstens , Claudio Imbrenda , Christian Borntraeger Subject: [RFC PATCH 0/2] mm/gup: fix gup_fast with dynamic page table folding Date: Fri, 28 Aug 2020 16:03:12 +0200 Message-Id: <20200828140314.8556-1-gerald.schaefer@linux.ibm.com> X-Mailer: git-send-email 2.17.1 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235,18.0.687 definitions=2020-08-28_08:2020-08-28,2020-08-28 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1011 suspectscore=0 spamscore=0 bulkscore=0 mlxscore=0 phishscore=0 malwarescore=0 mlxlogscore=999 priorityscore=1501 adultscore=0 impostorscore=0 lowpriorityscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2008280108 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") introduced a subtle but severe bug on s390 with gup_fast, due to dynamic page table folding. The question "What would it require for the generic code to work for s390" has already been discussed here https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 and ended with a promising approach here https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1 which in the end unfortunately didn't quite work completely. We tried to mimic static level folding by changing pgd_offset to always dynamically calculate top level page table offset, and do nothing in folded pXd_offset. What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do not reflect this dynamic behaviour, and still act like static 5-level page tables. Here is an example of what happens with gup_fast on s390, for a task with 3-levels paging, crossing a 2 GB pud boundary: // addr = 0x1007ffff000, end = 0x10080001000 static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; // pud_offset returns &p4d itself (a pointer to a value on stack) pudp = pud_offset(&p4d, addr); do { // on second iteratation reading "random" stack value pud_t pud = READ_ONCE(*pudp); // next = 0x10080000000 next = pud_addr_end(addr, end); ... } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack return 1; } pud_addr_end = 0x10080000000 is correct, but the previous pgd/p4d_addr_end should also have returned that limit, instead of the 5-level static pgd/p4d limits with PUD_SIZE/MASK != PGDIR_SIZE/MASK. Then the "end" parameter for gup_pud_range would also have been 0x10080000000, and we would not iterate further in gup_pud_range, but rather go back and (correctly) do it in gup_pgd_range. So, for the second iteration in gup_pud_range, we will increase pudp, which pointed to a stack value and not the real pud table. This new pudp will then point to whatever lies behind the p4d stack value. In general, this happens to be the previously read pgd, but it probably could also be something different, depending on compiler decisions. Most unfortunately, if it happens to be the pgd value, which is the same as the p4d / pud due to folding, it is a valid and present entry. So after the increment, we would still point to the same pud entry. The addr however has been increased in the second iteration, so that we now have different pmd/pte_index values, which will result in very wrong behaviour for the remaining gup_pmd/pte_range calls. We will effectively operate on an address minus 2 GB, due to missing pudp increase. In the "good case", if nothing is mapped there, we will fall back to the slow gup path. But if something is mapped there, and valid for gup_fast, we will end up (silently) getting references on the wrong pages and also add the wrong pages to the **pages result array. This can cause data corruption. We came up with two possible fix-ups, both will introduce some gup-specific helper functions, which will have no effect on other archs than s390. Patch 1 is the solution that has already been discussed in https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 It will additionally pass along pXd pointers in gup_pXd_range, and also introduce pXd_offset_orig for s390, which takes an additional pXd entry value parameter. This allows correctly returning / incrementing pointers while still preseving the READ_ONCE logic for gup_fast. Patch 2 would instead introduce new gup_pXd_addr_end helpers, which take an additional pXd entry value parameter, that can be used on s390 to determine the correct page table level and return corresponding end / boundary. With that, the pointer iteration will always happen in gup_pgd_range. Comments / preferences are welcome. As a last resort, we might also revert back to the s390-specific gup_fast code, if none of the options are agreeable. Regards, Gerald