Received: by 2002:a25:23cc:0:0:0:0:0 with SMTP id j195csp372066ybj; Mon, 4 May 2020 23:43:29 -0700 (PDT) X-Google-Smtp-Source: APiQypKX9Xm4zea4PNwMgyi0mKE1ZLotmwoSpbEitxU+cLWpv7vZQ5SvS5LQcuc6/dsH6bLs51ZI X-Received: by 2002:a17:907:9489:: with SMTP id dm9mr1359278ejc.9.1588661009536; Mon, 04 May 2020 23:43:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1588661009; cv=none; d=google.com; s=arc-20160816; b=G30MAb1idUr593kQ/XBCwO6KvQj08QTs8qmQQdbytJ9eFcnG2Xq0EfnvmqaVdi9iHB WvTskWeboiXKBoevnER9ILqoOdezUu3AzrUs4saL71gEg+4P7y56CAUBGQrNXXAGkPO6 l4A4ajFWxu/zP8awG4gh13FkBjshBGnzWA8dJBQxZywzpVPy59cjj8OEiE9QCDI+ytn9 AY6TCiuOSJVaS49X5VVB2YmC3MQfLin+Ab7PM19ZDPOayPoO5B6NqIz+eR06+SD14FeF +sdfEiejQkRrq2C7El4rE01z/8yGqwWz3dm+sq0Sa4IQW745PCyy97iujhUpArjbGOsO MRhA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=o9FfO0jZWfzNUVxGuWRPBkf6HohwPJ4CZL2fL4GBX00=; b=TbmoyKX2YC28Bw4sMxZFfiyghi7oNTV9Aq1fcDw4IX6iZd/zxDVbhxaBWKR/URMpi4 zvwVyV+LqjaH4dENI6Ehm3aek219aBEQOehkmwn8annM8EzB5K4f6Av7+an9Mow2xwzF 0Z6tRqT5S4NCWx5E+xxux3ykyi8TBdqUbgz2fir8vn4jX9G6Bi89bHOt7Fb3rWRAjmCR 6dmLnfDGsqLApwORBGhIHY4r1QF8bTu+RFkBz3YUKd5CctyR75XmUyGBrNZnnorS6gfT eK1ynYyhUCUH8QDZrgv9MmWFxVNT4JKsQcs356zemejM3AFVycvixDl6nQq/Qmwg+R1x Qpfw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id y18si697232edr.442.2020.05.04.23.43.06; Mon, 04 May 2020 23:43:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728268AbgEEGl0 (ORCPT + 99 others); Tue, 5 May 2020 02:41:26 -0400 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:60640 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725766AbgEEGl0 (ORCPT ); Tue, 5 May 2020 02:41:26 -0400 Received: from dread.disaster.area (pa49-195-157-175.pa.nsw.optusnet.com.au [49.195.157.175]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id AEABB58BF91; Tue, 5 May 2020 16:41:14 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1jVrGU-000355-3x; Tue, 05 May 2020 16:41:14 +1000 Date: Tue, 5 May 2020 16:41:14 +1000 From: Dave Chinner To: Jan Kara Cc: Dan Schatzberg , Jens Axboe , Alexander Viro , Amir Goldstein , Tejun Heo , Li Zefan , Johannes Weiner , Michal Hocko , Vladimir Davydov , Andrew Morton , Hugh Dickins , Roman Gushchin , Shakeel Butt , Chris Down , Yang Shi , Ingo Molnar , "Peter Zijlstra (Intel)" , Mathieu Desnoyers , "Kirill A. Shutemov" , Andrea Arcangeli , Thomas Gleixner , "open list:BLOCK LAYER" , open list , "open list:FILESYSTEMS (VFS and infrastructure)" , "open list:CONTROL GROUP (CGROUP)" , "open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)" Subject: Re: [PATCH v5 0/4] Charge loop device i/o to issuing cgroup Message-ID: <20200505064114.GI2005@dread.disaster.area> References: <20200428161355.6377-1-schatzberg.dan@gmail.com> <20200428214653.GD2005@dread.disaster.area> <20200429102540.GA12716@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200429102540.GA12716@quack2.suse.cz> User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=ONQRW0k9raierNYdzxQi9Q==:117 a=ONQRW0k9raierNYdzxQi9Q==:17 a=kj9zAlcOel0A:10 a=sTwFKg_x9MkA:10 a=7-415B0cAAAA:8 a=rN2gyzmqWffP1ZJC7qsA:9 a=cE_fTOozNYi_-d1b:21 a=OBp7IKyJAvioFFSQ:21 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 29, 2020 at 12:25:40PM +0200, Jan Kara wrote: > On Wed 29-04-20 07:47:34, Dave Chinner wrote: > > On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote: > > > The loop device runs all i/o to the backing file on a separate kworker > > > thread which results in all i/o being charged to the root cgroup. This > > > allows a loop device to be used to trivially bypass resource limits > > > and other policy. This patch series fixes this gap in accounting. > > > > How is this specific to the loop device? Isn't every block device > > that offloads work to a kthread or single worker thread susceptible > > to the same "exploit"? > > > > Or is the problem simply that the loop worker thread is simply not > > taking the IO's associated cgroup and submitting the IO with that > > cgroup associated with it? That seems kinda simple to fix.... > > > > > Naively charging cgroups could result in priority inversions through > > > the single kworker thread in the case where multiple cgroups are > > > reading/writing to the same loop device. > > > > And that's where all the complexity and serialisation comes from, > > right? > > > > So, again: how is this unique to the loop device? Other block > > devices also offload IO to kthreads to do blocking work and IO > > submission to lower layers. Hence this seems to me like a generic > > "block device does IO submission from different task" issue that > > should be handled by generic infrastructure and not need to be > > reimplemented multiple times in every block device driver that > > offloads work to other threads... > > Yeah, I was thinking about the same when reading the patch series > description. We already have some cgroup workarounds for btrfs kthreads if > I remember correctly, we have cgroup handling for flush workers, now we are > adding cgroup handling for loopback device workers, and soon I'd expect > someone comes with a need for DM/MD worker processes and IMHO it's getting > out of hands because the complexity spreads through the kernel with every > subsystem comming with slightly different solution to the problem and also > the number of kthreads gets multiplied by the number of cgroups. So I > agree some generic solution how to approach IO throttling of kthreads / > workers would be desirable. Yup, that's pretty much what I was thinking: it's yet another special snowflake for cgroup-aware IO.... > OTOH I don't have a great idea how the generic infrastructure should look > like... I haven't given it any thought - it's not something I have any bandwidth to spend time on. I'll happily review a unified generic cgroup-aware kthread-based IO dispatch mechanism, but I don't have the time to design and implement that myself.... OTOH, I will make time to stop people screwing up filesystems and block devices with questionable complexity and unique, storage device dependent userspace visible error behaviour. This sort of change is objectively worse for users than not supporting the functionality in the first place. Cheers, Dave. -- Dave Chinner david@fromorbit.com