Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp3213305pxb; Mon, 18 Oct 2021 10:23:13 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzCer0dtBwjyF5rf7yMl1dm/3ERWyVaQgKBN/CoWHrTWjixgKOJ/0ezZTJRwXbGd9fLFfV9 X-Received: by 2002:a62:1887:0:b0:44c:872e:27ed with SMTP id 129-20020a621887000000b0044c872e27edmr29698243pfy.71.1634577793320; Mon, 18 Oct 2021 10:23:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634577793; cv=none; d=google.com; s=arc-20160816; b=XxoQjlEpK2TNu8ilhxdFgsK2lPMnEhRBDAJ7qNb5mNoKdJHkmqHBwm2xMRsDNmpFJl JHf5yUONpC2ChZdMBbAnufJlJZsgnHRndIgz2SHKxKgDfppYkM7cU2Bb6hP1snGfPwRD 3fJgqGS6L1nTGJwJkDU9BDgt4qyabXo5n+wOt4WwBkIK4vq/NoV7sB2hPtdUqDIuv4ub YPfk7VKO0nZBAw4j0dXr/EJjGdf06rwhYTgjBjYE1AYSbe7Lrczoy+KywsNJBvGo2Jdq ImgLEjgnCziXszDpH/exy9t574Ca4OOuQfO4synGBxzQXeCM7uNUL4yjnDE/sQ/NMfEx Ognw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=IkK2QoqRyo2XFgdeCBQ0lgdE1jbFt0tO01jEo4h0Stw=; b=c6RiEg70Kwxqd6UC/dLMwQLj/oxmBfd8TKiGdSis+0X4bpgmMsdA0S+nDkic0TFtxE ulLWUWsf+JgLzm2l/IOdzyp6XrhOq+GxXuvhD9b6ZOlv64UFm9qo64zyBd54NnR5vGzL Z3419ftqy16dVSwbtDJJD0HLh8MnKhbDOFOOA8RZ0sQXv2G8rZ+lSxhkSCF02cqdoQ5D wO9MfuOtUBrhp+1s9tb1OGtEBO+vEmvZmKOnKzEMHa9gMBCKUFB4jHPT2eamETe8KdS5 FVnCcBxlcAQ68tDoIRpLXbHNAmMUiLumAjs627YkfAcITCsb6FWycplYXMcsIOsRmtwu k8Cg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id x3si24852453pfh.86.2021.10.18.10.23.01; Mon, 18 Oct 2021 10:23:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234005AbhJRRYQ (ORCPT + 99 others); Mon, 18 Oct 2021 13:24:16 -0400 Received: from mail.kernel.org ([198.145.29.99]:58980 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233984AbhJRRYP (ORCPT ); Mon, 18 Oct 2021 13:24:15 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id A68E3610C8; Mon, 18 Oct 2021 17:13:38 +0000 (UTC) Date: Mon, 18 Oct 2021 18:13:35 +0100 From: Catalin Marinas To: Linus Torvalds Cc: Al Viro , Andreas Gruenbacher , Christoph Hellwig , "Darrick J. Wong" , Jan Kara , Matthew Wilcox , cluster-devel , linux-fsdevel , Linux Kernel Mailing List , "ocfs2-devel@oss.oracle.com" , Josef Bacik , Will Deacon Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl() Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 12, 2021 at 10:58:46AM -0700, Linus Torvalds wrote: > On Tue, Oct 12, 2021 at 10:27 AM Catalin Marinas > wrote: > > Apart from fault_in_pages_*(), there's also fault_in_user_writeable() > > called from the futex code which uses the GUP mechanism as the write > > would be destructive. It looks like it could potentially trigger the > > same infinite loop on -EFAULT. > > Hmm. > > I think the reason we do fault_in_user_writeable() using GUP is that > > (a) we can avoid the page fault overhead > > (b) we don't have any good "atomic_inc_user()" interface or similar > that could do a write with a zero increment or something like that. > > We do have that "arch_futex_atomic_op_inuser()" thing, of course. It's > all kinds of crazy, but we *could* do > > arch_futex_atomic_op_inuser(FUTEX_OP_ADD, 0, &dummy, uaddr); > > instead of doing the fault_in_user_writeable(). > > That might be a good idea anyway. I dunno. I gave this a quick try for futex (though MTE is not affected at the moment): https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/sub-page-faults However, I still have doubts about fault_in_pages_*() probing every 16 bytes, especially if one decides to change these routines to be GUP-based. > > A more invasive change would be to return a different error for such > > faults like -EACCESS and treat them differently in the caller. > > That's _really_ hard for things like "copy_to_user()", that isn't a > single operation, and is supposed to return the bytes left. > > Adding another error return would be nasty. > > We've had hacks like "squirrel away the actual error code in the task > structure", but that tends to be unmaintainable because we have > interrupts (and NMI's) doing their own possibly nested atomics, so > even disabling preemption won't actually fix some of the nesting > issues. I think we can do something similar to the __get_user_error() on arm64. We can keep the __copy_to_user_inatomic() etc. returning the number of bytes left but change the exception handling path in those routines to set an error code or boolean to a pointer passed at uaccess routine call time. The caller would do something along these lines: bool page_fault; left = copy_to_user_inatomic(dst, src, size, &page_fault); if (left && page_fault) goto repeat_fault_in; copy_to_user_nofault() could also change its return type from -EFAULT to something else based on whether page_fault was set or not. Most architectures will use a generic copy_to_user_inatomic() wrapper where page_fault == true for any fault. Arm64 needs some adjustment to the uaccess fault handling to pass the fault code down to the exception code. This way, at least for arm64, I don't think an interrupt or NMI would be problematic. > All of these things make me think that the proper fix ends up being to > make sure that our "fault_in_xyz()" functions simply should always > handle all faults. > > Another option may be to teach the GUP code to actually check > architecture-specific sub-page ranges. Teaching GUP about this is likely to be expensive. A put_user() for probing on arm64 uses a STTR instruction that's run with user privileges on the user address and the user tag checking mode. The GUP code for MTE, OTOH, would need to explicitly read the tag in memory and compare it with the user pointer tag (which is normally cleared in the GUP code by untagged_addr()). To me it makes more sense for the fault_in_*() functions to only deal with those permissions the kernel controls, i.e. the pte. Sub-page permissions like MTE or CHERI are controlled by the user directly, so the kernel cannot fix them up anyway. Rather than overloading fault_in_*() with additional checks, I think we should expand the in-atomic uaccess API to cover the type of fault. -- Catalin