Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753869AbcLYVvU (ORCPT ); Sun, 25 Dec 2016 16:51:20 -0500 Received: from mail-it0-f67.google.com ([209.85.214.67]:33698 "EHLO mail-it0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751674AbcLYVvS (ORCPT ); Sun, 25 Dec 2016 16:51:18 -0500 MIME-Version: 1.0 In-Reply-To: <20161225030030.23219-3-npiggin@gmail.com> References: <20161225030030.23219-1-npiggin@gmail.com> <20161225030030.23219-3-npiggin@gmail.com> From: Linus Torvalds Date: Sun, 25 Dec 2016 13:51:17 -0800 X-Google-Sender-Auth: f_plt1VRmWD3p4WMfJv5tiA7KSA Message-ID: Subject: Re: [PATCH 2/2] mm: add PageWaiters indicating tasks are waiting for a page bit To: Nicholas Piggin Cc: Dave Hansen , Bob Peterson , Linux Kernel Mailing List , Steven Whitehouse , Andrew Lutomirski , Andreas Gruenbacher , Peter Zijlstra , linux-mm , Mel Gorman Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mail.home.local id uBPLpTGK011991 Content-Length: 1384 Lines: 32 On Sat, Dec 24, 2016 at 7:00 PM, Nicholas Piggin wrote: > Add a new page flag, PageWaiters, to indicate the page waitqueue has > tasks waiting. This can be tested rather than testing waitqueue_active > which requires another cacheline load. Ok, I applied this one too. I think there's room for improvement, but I don't think it's going to help to just wait another release cycle and hope something happens. Example room for improvement from a profile of unlock_page(): 46.44 │ lock andb $0xfe,(%rdi) 34.22 │ mov (%rdi),%rax this has the old "do atomic op on a byte, then load the whole word" issue that we used to have with the nasty zone lookup code too. And it causes a horrible pipeline hickup because the load will not forward the data from the (partial) store. Its' really a misfeature of our asm optimizations of the atomic bit ops. Using "andb" is slightly smaller, but in this case in particular, an "andq" would be a ton faster, and the mask still fits in an imm8, so it's not even hugely larger. But it might also be a good idea to simply use a "cmpxchg" loop here. That also gives atomicity guarantees that we don't have with the "clear bit and then load the value". Regardless, I think this is worth more people looking at and testing. And merging it is probably the best way for that to happen. Linus