Received: by 10.192.165.148 with SMTP id m20csp2262261imm; Sat, 28 Apr 2018 17:01:05 -0700 (PDT) X-Google-Smtp-Source: AB8JxZoZ4WF3qeXtXMcgzlBlI+BMSSRphWqakwDHVi7Nkt0JAAiqdX48pRH+BO+gwSrXZ98XaU/v X-Received: by 2002:a17:902:b60a:: with SMTP id b10-v6mr7319514pls.221.1524960065257; Sat, 28 Apr 2018 17:01:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524960065; cv=none; d=google.com; s=arc-20160816; b=EcV8KEQjalgmlKo39d1IKRetYTffFoHyDrGXqiYQNRnJBM8QpMkFGtj8VoOsh2VNls KmeDM2Ctu3nFY515pUNxUjePRyqUlXTQmyqaC7Zd7I+fZBaXCiJW60JT5YSzkDqJjzVv rL0DDZnCzaBXBDH8umSeMaEoaHqEzEqqrsn42Hr2eJk+QxgKeglwjuQnGhWF+/tyKi+Q zj2kHP/1uvWVzfZpZ/gnLqZrX+qi1auoEwk/ohFfEPTAMno9XVY1iOswnIFO6mnrcc35 +ABPCGeROqBmBCEUKBdiS0dHcXknS8kwDdXvTTEg3iaOYtbW+lmqEuhK96VL0tOslSfE U3vg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:references:in-reply-to:mime-version :dkim-signature:arc-authentication-results; bh=F2QSn0KvnyIfahWtM1xrMBS+SOPX8CHmk0dOFfYO+Xw=; b=D52gyIRPKij3KycJ8LJ53JfvSDAtPDgvhUWawJf2Mm9CiYm3jAxNH7WigmnWMsQ0A5 aY3BN0UZ3I8o/CY7DRJEHlkcPy+bQ/5ttbSDG65UPn3Zjcusz7fJDVNYCL92AOMOSg65 +kKa5DetGJV0Qxbd1GuE5Zc3bbGV/qaRm5kCCz0ezeJHB6Z29Iwx6MKYPQR5go1CaqcZ A2Uz/rkviE8qS7pWDMj7VfZ75WYWG6TTu/fy18UBEAH0bTZtFlnEctd/SJCpzzqh5cfF ZW8OroUX4NR9qtLtBzh/Yy6QRtKt01lavBY6PRQwjcLbZXE1NVAJwA7UKV/7453pQXTr ZU0Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b=rseB63c4; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j67-v6si3827815pgc.509.2018.04.28.17.00.05; Sat, 28 Apr 2018 17:01:05 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@gmail.com header.s=20161025 header.b=rseB63c4; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752695AbeD1X4X (ORCPT + 99 others); Sat, 28 Apr 2018 19:56:23 -0400 Received: from mail-qt0-f194.google.com ([209.85.216.194]:35203 "EHLO mail-qt0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752205AbeD1X4W (ORCPT ); Sat, 28 Apr 2018 19:56:22 -0400 Received: by mail-qt0-f194.google.com with SMTP id s2-v6so6998894qti.2 for ; Sat, 28 Apr 2018 16:56:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-transfer-encoding; bh=F2QSn0KvnyIfahWtM1xrMBS+SOPX8CHmk0dOFfYO+Xw=; b=rseB63c48FBLCMoXUk7MlgC5hlbOfFvCRG17A673Q5XHN4VrOusKnprYh2qFiq6ZWi NgBqMihC1Nh5tnDXBs57H1CyghdqrDdGOgGC2VKxyCIOLfKw9UEp642P2tg2Bh/olQtE OmQbqhePGr1WqB27RpCg3J9WF6yoVA4u+8UNebROEXydmKsA4mCISmbWjeryxwQXtp6y 8sucYnnrVMHkJtRyHPvG/NZ388jPi3pg8RxfzDc/t5qOUKQb3SIUstLK4IB0QIKOW8eU mJVtvHNT6q6pGFlAD361lAFdHfG2Z+H4SYRzrbSbPk0iYLtLwWI0Q7Dz7ER4oFvt8wvy E0sg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc:content-transfer-encoding; bh=F2QSn0KvnyIfahWtM1xrMBS+SOPX8CHmk0dOFfYO+Xw=; b=PC7AtHqCMQQyBb2Z1Zy2Of1LFUB5bNltV7onchuyrXoCMfya0UUSeYA40rOsL/duZa YZvBHaG+iUkcKO1RTS7kuCHLhFQRJUdW+Z4lAOc6ANHwQl0x/kD08KK3y/Vp9DHLUQsr YeSjqTwISBDFCW8kap0gz41geJFkTH7V6SFMhLzqt9gcTu+9WHvjUkuMSxAPjbGaV/+J q8ut+G7U9z5lopYIwwNL9txC3jz3p/LD1DKeDMUI3URT8AwwjuRrn/eZkM9BHrS1/DRm wx12gGG3hFEtWBlDEb50oSKeuFIj5DVYLaBQhm+ecScO5hK01jW54cCqgrl+ToQpXmuI kXkA== X-Gm-Message-State: ALQs6tCM4gwjhNEWEG/b+jaony1pUgp1Z+aC9WmvbYnYvUuSBDoXfJTd sxCcT9TmPeYmqGoBw3zhV5X2PrwAn2FwDhT/Q1c= X-Received: by 2002:aed:250d:: with SMTP id v13-v6mr7199583qtc.76.1524959781224; Sat, 28 Apr 2018 16:56:21 -0700 (PDT) MIME-Version: 1.0 Received: by 10.200.56.243 with HTTP; Sat, 28 Apr 2018 16:56:20 -0700 (PDT) In-Reply-To: References: <20180426150618.13470-1-michel@daenzer.net> <20180427130811.7642-1-michel@daenzer.net> From: Ilia Mirkin Date: Sat, 28 Apr 2018 19:56:20 -0400 X-Google-Sender-Auth: PHtCw-2-keSWbfIcZyZn42y1yc0 Message-ID: Subject: Re: [PATCH v2 1/2] drm/ttm: Only allocate huge pages with new flag TTM_PAGE_FLAG_TRANSHUGE To: =?UTF-8?Q?Michel_D=C3=A4nzer?= Cc: dri-devel , =?UTF-8?Q?Christian_K=C3=B6nig?= , amd-gfx mailing list , LKML Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Apr 28, 2018 at 7:02 PM, Michel D=C3=A4nzer wr= ote: > On 2018-04-28 06:30 PM, Ilia Mirkin wrote: >> On Fri, Apr 27, 2018 at 9:08 AM, Michel D=C3=A4nzer = wrote: >>> From: Michel D=C3=A4nzer >>> >>> Previously, TTM would always (with CONFIG_TRANSPARENT_HUGEPAGE enabled) >>> try to allocate huge pages. However, not all drivers can take advantage >>> of huge pages, but they would incur the overhead for allocating and >>> freeing them anyway. >>> >>> Now, drivers which can take advantage of huge pages need to set the new >>> flag TTM_PAGE_FLAG_TRANSHUGE to get them. Drivers not setting this flag >>> no longer incur any overhead for allocating or freeing huge pages. >>> >>> v2: >>> * Also guard swapping of consecutive pages in ttm_get_pages >>> * Reword commit log, hopefully clearer now >>> >>> Cc: stable@vger.kernel.org >>> Signed-off-by: Michel D=C3=A4nzer >> >> Both I and lots of other people, based on reports, are still seeing >> plenty of issues with this as late as 4.16.4. > > "lots of other people", "plenty of issues" sounds a bit exaggerated from > what I've seen. FWIW, while I did see the original messages myself, I > haven't seen any since Christian's original fix (see below), neither > with amdgpu nor radeon, even before this patch you followed up to. Probably a half-dozen reports of it with nouveau, in addition to another bunch of people talking about it on the bug you mention below, along with email threads on dri-devel. I figured I didn't have to raise my own since it was identical to the others, and, I assumed, was being handled. >> Admittedly I'm on nouveau, but others have reported issues with >> radeon/amdgpu as well. It's been going on since the feature was merged >> in v4.15, with what seems like little investigation from the authors >> introducing the feature. > > That's not a fair assessment. See > https://bugs.freedesktop.org/show_bug.cgi?id=3D104082#c40 and following > comments. > > Christian fixed the original issue in > d0bc0c2a31c95002d37c3cc511ffdcab851b3256 "swiotlb: suppress warning when > __GFP_NOWARN is set". Christian did his best to try and get the fix in > before 4.15 final, but for reasons beyond his control, it was delayed > until 4.16-rc1 and then backported to 4.15.5. In case it's unclear, let me state this explicitly -- I totally get that despite best intentions, bugs get introduced. I do it myself. What I'm having trouble with is the handling once the issue is discovered. > > Unfortunately, there was an swiotlb regression (not directly related to > Christian's work) shortly after this fix, also in 4.16-rc1, which is now > fixed in 4.17-rc1 and will be backported to 4.16.y. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/com= mit/?h=3Dv4.16.5&id=3D2c9dacf5bfe1e45d96dfe97cb71d2b717786a7b9 This guy? Didn't help. I'm running 4.16.4 right now. > It looks like there's at least one more bug left, but it's not clear yet > when that was introduced, whether it's directly related to Christian's > work, or indeed what the impact is. Let's not get ahead of ourselves. Whether it is directly related to that work or not, the issue persists. There are two options: - When declaring things fixed, no serious attempt was actually made at reproducing the underlying issues. - The authors truly can't reproduce the underlying issues users are seeing and are taking stabs in the dark. Given that a number of people are reporting problems, in either scenario, the reasonable thing is to disable the feature, and figure out what is going on. Maybe condition it on !CONFIG_SWIOTLB. >> We now have *two* broken releases, v4.15 and v4.16 (anything that >> spews error messages and stack traces ad-infinitum in dmesg is, by >> definition, broken). > > I haven't seen any evidence that there's still an issue in 4.15, is > there any? Well, I did have a late 4.15 rc kernel in addition to the 'suppress warning' patch. Now I'm questioning my memory of whether the issue was resolved there or not. I'm pretty sure that 'not', but no longer 100%. Either way, I think we all agree v4.15 was broken and more importantly was *known* to be broken well in advance of the release. A reasonable option would have been to disable the feature until the other bits fell into place. >> You're putting this behind a flag now (finally), > > I wrote this patch because I realized due to some remark I happened to > see you make this week on IRC that the huge page support in TTM was > enabled for all drivers. Instead of making that kind of remark on IRC, > it would have been more constructive, and more conducive to quick > implementation, to suggest making the feature not active for drivers > which don't need it in a mailing list post. I see IRC as a much faster and direct way of reaching the authors and/or people who need to know an issue. As there was already a bug filed about it and the issue was known about, I didn't really see a reason to pile on (until now). > At least, please do more research before making this kind of negative > post. Every time I've reported it, I've been told that patch X definitely solves the issue. There was the original fix from Christian which went into v4.15 and I'm pretty sure didn't fix it (I had it applied to a 4.15 tree), and then the later issue with the swiotlb logic inversion bug which also didn't fix it. It's entirely possible that the true issue lies not in the code that was written as part of this feature enablement but rather existing code in handling of thp. But the end result is that I have a broken kernel. As a user who is not in a position to debug this directly due to lack of time and knowledge, my options are limited. This issue hasn't gotten a ton of visibility since it's waved away every time as "fixed", so I'm trying to turn up the heat a little bit to cause a fix or revert to happen. I believe the policy in Linux is "no regressions", unlike many other graphics components where regressions are welcome as long as they're downstream components of where the change is made. > P.S. You might also want to look into whether nouveau really should be > hitting swiotlb in these cases. I don't have a strong enough concept of what the swiotlb does and when it's needed. Hopefully someone that does will take a look. -ilia