Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp4357205pxf; Tue, 30 Mar 2021 06:05:34 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwvINUe0SuM/BIGYutH7wFLCXGFKHMM0XFYX3o8fF8zj3KS4wxVrGF2Wb/JoYWW4HzYh8Ab X-Received: by 2002:aa7:da97:: with SMTP id q23mr33064701eds.180.1617109534746; Tue, 30 Mar 2021 06:05:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617109534; cv=none; d=google.com; s=arc-20160816; b=DoMjl3OkGcuc42+xDBr3r3zWIgYP3XxiKdE2KlhIdqfrlUGw3KuLCIWGpUAGGGaY3j sk0YI/kEwwGXT6Jr4cqTQXliTy2OqxSqJVto3mMCcM8oZQ/AGoB/mpnOAe+VMzNzSwub 8eosZNZB3jGp4j3fioD8N728QK0RXQD1HSw6aLXbqe4cv2mV5aiOi+galDMLI8PWbOyO gsQgLngQtYtym/m283C+jleQXax74AYDOS8mWTlxtcn0rRvtZEqX0cLeM1f4mPPYinuA T/OXqJVzZ5vF+7AD8eojttzEnjdgjoE3ogsFCm0STuUxaub8FMLigGNwXKSodtwx+io+ axYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=X2EGDZ9BOxE5WHmYu74k0fWpD97KnsQjPUvAY9jam48=; b=Ggw7EW125JxvIgzaZBl8Yxp2A/4aRT9SYC/1PzwH1bG+Tc3XECesdy9fBhIicvs+qf T+jx46BTTbScyRcGC8dtZdZ7rSHl34T6wEt8vmO+75GXcqpTEWjvZhtpvOa9BPA6ZZuS iCocLWLO2I0caCH9TfJaBqHFF99ry6OA+ShiT6rAlwJCxb1g+hAbE4IoPkhy9YzltjAn 86eDseP+kzONoVX85geVqFoe7t04Zneln8WF4VGS5BvodO1KbIF7XOuofIccsmqiCh7K eGGM17f1f384Lnvzr3JON8j6x6szCFuwUEmSZoj1iofIdqHi09gOVNctZzBHIzWNe735 pGzw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@mengyan1223.wang header.s=mail header.b=X9+iU8o1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=mengyan1223.wang Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id qh20si14212517ejb.749.2021.03.30.06.05.12; Tue, 30 Mar 2021 06:05:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@mengyan1223.wang header.s=mail header.b=X9+iU8o1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=mengyan1223.wang Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231848AbhC3NDi (ORCPT + 99 others); Tue, 30 Mar 2021 09:03:38 -0400 Received: from mengyan1223.wang ([89.208.246.23]:60036 "EHLO mengyan1223.wang" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232051AbhC3NDO (ORCPT ); Tue, 30 Mar 2021 09:03:14 -0400 Received: from [IPv6:240e:35a:1037:8a00:70b2:e35d:833c:af3e] (unknown [IPv6:240e:35a:1037:8a00:70b2:e35d:833c:af3e]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature ECDSA (P-384) server-digest SHA384) (Client did not present a certificate) (Authenticated sender: xry111@mengyan1223.wang) by mengyan1223.wang (Postfix) with ESMTPSA id D93746594D; Tue, 30 Mar 2021 09:02:21 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mengyan1223.wang; s=mail; t=1617109392; bh=X2EGDZ9BOxE5WHmYu74k0fWpD97KnsQjPUvAY9jam48=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=X9+iU8o1vAmsAaz3z1YxHtzBYrzVoeUMfn9h4Rp4s308CHVoVU+a6Gm9GdlKHJIMU 8AsSOQYR5ryPtMKkHfCfZfbP5mZ4PMDu2fV1zZ3RkV9HTpo4YMJDD9fxB9ZNrThDVp 2uc59oXova4FbxZPKF/a7sHqurDtU9fbphbsLv0MumyvPPZOz22iYYeREiLRFMAex6 DqUuuwGBvBS3eb3NGt5UN99Duq2/c+qsmT08bdCP+ynIaneJcJbIwSPp0uyXDPerAE CXRmp4Mh4tSVfoIWsyMTPG9F2FbyLDEFWt3fZGWep5oq/X1+bqoa4hUDB7QIxGTEKv 5iyqqUE0IyIVw== Message-ID: Subject: Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems From: Xi Ruoyao To: Christian =?ISO-8859-1?Q?K=F6nig?= , Christian =?ISO-8859-1?Q?K=F6nig?= , Alex Deucher Cc: David Airlie , Felix Kuehling , linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, Dan =?ISO-8859-1?Q?Hor=E1k?= , amd-gfx@lists.freedesktop.org, Daniel Vetter , stable@vger.kernel.org Date: Tue, 30 Mar 2021 21:02:08 +0800 In-Reply-To: References: <20210329175348.26859-1-xry111@mengyan1223.wang> <9a11c873-a362-b5d1-6d9c-e937034e267d@gmail.com> <84b3911173ad6beb246ba0a77f93d888ee6b393e.camel@mengyan1223.wang> <97c520ce107aa4d5fd96e2c380c8acdb63d45c37.camel@mengyan1223.wang> <7701fb71-9243-2d90-e1e1-d347a53b7d77@gmail.com> <368b9b1b7343e35b446bb1028ccf0ae75dc2adc4.camel@mengyan1223.wang> <71e3905a5b72c5b97df837041b19175540ebb023.camel@mengyan1223.wang> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.40.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2021-03-30 14:55 +0200, Christian König wrote: > Am 30.03.21 um 14:04 schrieb Xi Ruoyao: > > On 2021-03-30 03:40 +0800, Xi Ruoyao wrote: > > > On 2021-03-29 21:36 +0200, Christian König wrote: > > > > Am 29.03.21 um 21:27 schrieb Xi Ruoyao: > > > > > Hi Christian, > > > > > > > > > > I don't think there is any constraint implemented to ensure `num_entries > > > > > % > > > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`.  For example, in > > > > > `amdgpu_vm_bo_map()`: > > > > > > > > > >           /* validate the parameters */ > > > > >           if (saddr & AMDGPU_GPU_PAGE_MASK || offset & > > > > > AMDGPU_GPU_PAGE_MASK > > > > >               size == 0 || size & AMDGPU_GPU_PAGE_MASK) > > > > >                   return -EINVAL; > > > > > > > > > > /* snip */ > > > > > > > > > >           saddr /= AMDGPU_GPU_PAGE_SIZE; > > > > >           eaddr /= AMDGPU_GPU_PAGE_SIZE; > > > > > > > > > > /* snip */ > > > > > > > > > >           mapping->start = saddr; > > > > >           mapping->last = eaddr; > > > > > > > > > > > > > > > If we really want to ensure (mapping->last - mapping->start + 1) % > > > > > AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace > > > > > "AMDGPU_GPU_PAGE_MASK" > > > > > in "validate the parameters" with "PAGE_MASK". > > It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of > > "AMDGPU_GPU_PAGE_MASK" :(. > > > > > > Yeah, good point. > > > > > > > > > I tried it and it broke userspace: Xorg startup fails with EINVAL with > > > > > this > > > > > change. > > > > Well in theory it is possible that we always fill the GPUVM on a 4k > > > > basis while the native page size of the CPU is larger. Let me double > > > > check the code. > > On my platform, there are two issues both causing the VM corruption.  One is > > fixed by agd5f/linux@fe001e7. > > What is fe001e7? A commit id? If yes then that is to short and I can't > find it. > > >    Another is in Mesa from userspace:  it uses > > `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver > > expects it to use `dev_info->virtual_address_alignment`. > > Mhm, looking at the kernel code I would rather say Mesa is correct and > the kernel driver is broken. > > The gart_page_size is limited by the CPU page size, but the > virtual_address_alignment isn't. > > > If we can make the change to fill the GPUVM on a 4k basis, we can fix this > > issue > > and make virtual_address_alignment = 4K.  Otherwise, we should fortify the > > parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK".  Then > > the > > userspace will just get an EINVAL, instead of a slient VM corruption.  And > > someone should tell Mesa developers to fix the code in this case. > > I rather see this as a kernel bug. Can you test if this code fragment > fixes your issue: > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > index 64beb3399604..e1260b517e1b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void > *data, struct drm_file *filp) >                  } >                  dev_info->virtual_address_alignment = > max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE); >                  dev_info->pte_fragment_size = (1 << > adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE; > -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE; > +               dev_info->gart_page_size = > dev_info->virtual_address_alignment; >                  dev_info->cu_active_number = adev->gfx.cu_info.number; >                  dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask; >                  dev_info->ce_ram_size = adev->gfx.ce_ram_size; It works. I've seen it at https://github.com/xen0n/linux/commit/84ada72983838bd7ce54bc32f5d34ac5b5aae191 before (with a common sub-expression, though :). -- Xi Ruoyao School of Aerospace Science and Technology, Xidian University