Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp11262804imu; Thu, 6 Dec 2018 14:25:46 -0800 (PST) X-Google-Smtp-Source: AFSGD/U8VCmsg4hV8vvM2nr5Tl9+0MdBsv7a5uUBd2LZSnflT6E45bK97OmU1toEqxI2VN/g0yMp X-Received: by 2002:a17:902:1126:: with SMTP id d35mr28056724pla.1.1544135146869; Thu, 06 Dec 2018 14:25:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544135146; cv=none; d=google.com; s=arc-20160816; b=AyVNMSPFoO16Tuk0uh9b3ntJjY/Hl9MFdo+WqDIUd5nUf/oXJIze3tvw2O7doZACuJ PblDls6NkHfzCjY8HomL+41E8U/kKBeZxhGOHBCDUoJ0Fct9lAfvsu7g6RYmElYcHVNT SO/VTJs7DY8gZ69/cqErQh2oEp6idx7RAlX+JllHkZD7K+mRKMqL2pd8MGd2kI6tvp+3 k+rRdiPHjf7wvOVOAf1n7lpCcMGHzGxLiNX24mnfwhuUvtzp4WjQL0yaI4XBqu9EEoIJ VieWkGawJC5UbA+Rta4WwocUKAUzo6vmvIpVIH/+L2ZBWJNJc19MxwZXvfuyUb9YWx9u Y8lQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=pD0MStUWUjDmjx2XXT2PKm44p4zoXxmjGVXxyeqHEaU=; b=lzXkgkaHpg5cfyblX5BRxIqeaqQ118/7BoG4Apr5I1xas+4/D5UbU8ssX1BWG5Y9VO YEx6s4hC6bYTLeFZxiFZqhzr8mfRju8+mJ4qjuP/hyOV3JarS3la4hUMKZBu5IxVBuXV SQUatVx66QblWRbzGmvfikRy+z2YzhrgDiHnY3Csx5tthE4G12sz/OmRIvYE28ZpTPhp rQ8C+dCtD2kbM2+vKNrVt1CSnqJJG6P5wCVQrxjVC6qsmrsdy+1B9+b572Qkp8modmKB BhPaRkwfgDDL/Rd8s8pJO61vXu0ZkbMrnF1+X9GLyE3xPMnxlvtRbvrWphl6u+jdOKjX KQ7g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 204si1353909pfu.273.2018.12.06.14.25.31; Thu, 06 Dec 2018 14:25:46 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726039AbeLFWYq (ORCPT + 99 others); Thu, 6 Dec 2018 17:24:46 -0500 Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:5840 "EHLO ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725935AbeLFWYq (ORCPT ); Thu, 6 Dec 2018 17:24:46 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail06.adl6.internode.on.net with ESMTP; 07 Dec 2018 08:54:42 +1030 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1gV24W-0000aQ-NX; Fri, 07 Dec 2018 09:24:40 +1100 Date: Fri, 7 Dec 2018 09:24:40 +1100 From: Dave Chinner To: Andrew Morton Cc: Josef Bacik , kernel-team@fb.com, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, tj@kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, riel@redhat.com, jack@suse.cz Subject: Re: [PATCH 0/4][V4] drop the mmap_sem when doing IO in the fault path Message-ID: <20181206222440.GA19305@dastard> References: <20181130195812.19536-1-josef@toxicpanda.com> <20181204144931.03566f7e21615e3c2c1b18e8@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181204144931.03566f7e21615e3c2c1b18e8@linux-foundation.org> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 04, 2018 at 02:49:31PM -0800, Andrew Morton wrote: > On Fri, 30 Nov 2018 14:58:08 -0500 Josef Bacik wrote: > > > Now that we have proper isolation in place with cgroups2 we have started going > > through and fixing the various priority inversions. Most are all gone now, but > > this one is sort of weird since it's not necessarily a priority inversion that > > happens within the kernel, but rather because of something userspace does. > > > > We have giant applications that we want to protect, and parts of these giant > > applications do things like watch the system state to determine how healthy the > > box is for load balancing and such. This involves running 'ps' or other such > > utilities. These utilities will often walk /proc//whatever, and these > > files can sometimes need to down_read(&task->mmap_sem). Not usually a big deal, > > but we noticed when we are stress testing that sometimes our protected > > application has latency spikes trying to get the mmap_sem for tasks that are in > > lower priority cgroups. > > > > This is because any down_write() on a semaphore essentially turns it into a > > mutex, so even if we currently have it held for reading, any new readers will > > not be allowed on to keep from starving the writer. This is fine, except a > > lower priority task could be stuck doing IO because it has been throttled to the > > point that its IO is taking much longer than normal. But because a higher > > priority group depends on this completing it is now stuck behind lower priority > > work. > > > > In order to avoid this particular priority inversion we want to use the existing > > retry mechanism to stop from holding the mmap_sem at all if we are going to do > > IO. This already exists in the read case sort of, but needed to be extended for > > more than just grabbing the page lock. With io.latency we throttle at > > submit_bio() time, so the readahead stuff can block and even page_cache_read can > > block, so all these paths need to have the mmap_sem dropped. > > > > The other big thing is ->page_mkwrite. btrfs is particularly shitty here > > because we have to reserve space for the dirty page, which can be a very > > expensive operation. We use the same retry method as the read path, and simply > > cache the page and verify the page is still setup properly the next pass through > > ->page_mkwrite(). > > Seems reasonable. I have a few minorish changeloggish comments. > > We're at v4 and no acks have been gathered? I looked at previous versions and had a bunch of questions and change requests. I haven't had time to look at this version yet, but seeing as the page_mkwrite() stuff has been dropped from this version it isn't useful anymore for solving the problem I had in mind when reviewing it originally... What I really want is unconditionally retriable page faults so the filesystem can cause the page fault to be restarted from scratch. We have a requirement for DAX and shared data extents (reflink) to work, and that requires changing the faulted page location during page_mkwrite. i.e. we get a fault on a read mapped shared page, then we have to do a copy-on-write operation to break physical data sharing and so the page with the file data in it physically changes during ->page_mkwrite (because DAX). Hence we need to restart the page fault to map the new page correctly because the file no longer points at the page that was originally faulted. With this stashed-page-and-retry mechanism implemented for ->page_mkwrite, we could stash the new page in the vmf and tell the fault to retry, and everything would just work. Without ->page_mkwrite support, it's just not that interesting and I have higher priority things to deal with right now.... Cheers, Dave. -- Dave Chinner david@fromorbit.com