Received: by 2002:a25:23cc:0:0:0:0:0 with SMTP id j195csp741529ybj; Tue, 5 May 2020 06:57:27 -0700 (PDT) X-Google-Smtp-Source: APiQypLSlITWQuf36Z3VvAsw9pyunVBb4fWlxRj5EsfRR0pOTdg3GRdaNYEBY0qgTQeETIfOTsm1 X-Received: by 2002:a17:906:2351:: with SMTP id m17mr2850096eja.179.1588687047761; Tue, 05 May 2020 06:57:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1588687047; cv=none; d=google.com; s=arc-20160816; b=Q5zAmjRiAZDEfizVx4jIBDVp/G1AsL61uWtLr4cQ8aSiqTgxkAmY/ramFfyvc8vtZ1 sX6giFIQxbipnOwPHWAPkWHhGyB59ZOy7/L+QNxM/GFEOFPGppyEz29F9RooaqeKraE3 pRYQV+C5HhwNAj1LIQrqQz0PLCXIQ3AQTMubTaNIwg+EzK9+sov3qsOZP2F8Hd4N+SA0 sSZ5LeNkPupAjXCknRp8Ha7kUfE9pojNsurH4224d6DvsT8wt7NYN3qnupSTJA+gDZoh mCBbgt2o8woTz988zP6O/TVXbjoMgB5YolKCc+vcVbwyqqrFTmjm4h0AVlW4HF/RcZm8 x40w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=9W+5Jm1b81UkwC63+rGAZ2lTBixHqKEHTO4R5ClYteM=; b=TKhnCUt0zG8VMlG10iwjt+R/xxfdFS8LKTmsXI2o/49Z97Zgsw6UcvdqdrxN18/7Yq RSAGMMWtHiHxfnLKR114CfwZLuuyJz1hnJq6ZpvytgW2/cEuZ4hvJHbwKxSeYKzinmik wj/TvbLXMvaOJlHxJmLO31+bgshdPlLRcpOpwjKHVc45VfMttUER81IyD1pb/vrnjRqu PmMDaaPd1mRLe4k51o/GOobKCNlraq/vB/fMqIaR7IIA/Q9v2F543o4ATw3ERTJnrXLV 7bvi5OMSt12vZdi1Aemg9OG54yNr3LFnsX6oZlpDKnu/NKFeQlwfxfNWTvt0LgPeY5YG VhRg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=mEWDTOsv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id rs7si1041517ejb.533.2020.05.05.06.57.05; Tue, 05 May 2020 06:57:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=mEWDTOsv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729165AbgEENz2 (ORCPT + 99 others); Tue, 5 May 2020 09:55:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45594 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1729118AbgEENz2 (ORCPT ); Tue, 5 May 2020 09:55:28 -0400 Received: from mail-lj1-x241.google.com (mail-lj1-x241.google.com [IPv6:2a00:1450:4864:20::241]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 788FCC061A10 for ; Tue, 5 May 2020 06:55:27 -0700 (PDT) Received: by mail-lj1-x241.google.com with SMTP id l19so1687312lje.10 for ; Tue, 05 May 2020 06:55:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=9W+5Jm1b81UkwC63+rGAZ2lTBixHqKEHTO4R5ClYteM=; b=mEWDTOsvRfH8ej77W5x3aLADfUstdTw6WfDghmv/Zkfa8SB4RMzBaziHPDJ5p3x3dw 7Dm4RQuOT4V1GffWyPxuMAT7l2ovGSh6LqZCI00mh05SP8aV5/fGZzfuF1f8IJbIhv5W cx89leMpPTfjAg/KVbmWmQki2ryF8mHz3DddN76pv7OPWhwlfMf+PP7yISHf/1HO5Ag4 55DEYLd3k3ZVRwrK6U2M3xrcHTt8ja3dboNFFlZ0eIt+vZbS8MaxIpProcXDIyiQ7O2/ 46u71QHnQDhmcSM3yx/4VeoPp8QHhBWvXzZ011t1Gt0dzDpVGWM6RjtTxcU3zWOEZeYX SJAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=9W+5Jm1b81UkwC63+rGAZ2lTBixHqKEHTO4R5ClYteM=; b=UfyIAAXQMYSboYY2dI/7PTlAYRIg1yI2H3oLwoTcUQUr2dpmump4RpNdw//WFUPXzM /WWtNMtcT7TWOhefCw1hQ1fGTUWjGEZDy+0fWUrFhDk5Q93ZFGwr5DhiGbbBaQY1jMes Ms2StEJ4c9pOGTURZ4Fx5J5+M7okkrrrSP0+0i8TqBx/HUd4kMxAWjWNQKZWltLfMDRI v6CH9PyOH/m6B8zJKDx4FCxXqqlk82xivx5QeFlqlAhQmqH8X+a89HCfZbSJ65sWvArF zCbg0TGDD1E2HbzLZiWTNbvWsfRNqNfGDMxAMvheOu0sxD5LWkyYX9JXICWmJrcu1zU+ KZuA== X-Gm-Message-State: AGi0PuZdZFjG7yBZDVzSKl7b0L3YDuxLq07mIFfl0Vsh2GX93MHTxuCj AbZNM0Anq50u0RQLlA9o5KQq/o2bzrap0a0Jt8P/eA== X-Received: by 2002:a2e:b6cf:: with SMTP id m15mr1852440ljo.168.1588686925417; Tue, 05 May 2020 06:55:25 -0700 (PDT) MIME-Version: 1.0 References: <20200428161355.6377-1-schatzberg.dan@gmail.com> <20200428214653.GD2005@dread.disaster.area> <20200429022732.GA401038@cmpxchg.org> <20200505062946.GH2005@dread.disaster.area> In-Reply-To: <20200505062946.GH2005@dread.disaster.area> From: Shakeel Butt Date: Tue, 5 May 2020 06:55:14 -0700 Message-ID: Subject: Re: [PATCH v5 0/4] Charge loop device i/o to issuing cgroup To: Dave Chinner Cc: Johannes Weiner , Dan Schatzberg , Jens Axboe , Alexander Viro , Jan Kara , Amir Goldstein , Tejun Heo , Li Zefan , Michal Hocko , Vladimir Davydov , Andrew Morton , Hugh Dickins , Roman Gushchin , Chris Down , Yang Shi , Ingo Molnar , "Peter Zijlstra (Intel)" , Mathieu Desnoyers , "Kirill A. Shutemov" , Andrea Arcangeli , Thomas Gleixner , "open list:BLOCK LAYER" , open list , "open list:FILESYSTEMS (VFS and infrastructure)" , "open list:CONTROL GROUP (CGROUP)" , "open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 4, 2020 at 11:30 PM Dave Chinner wrote: > > On Tue, Apr 28, 2020 at 10:27:32PM -0400, Johannes Weiner wrote: > > On Wed, Apr 29, 2020 at 07:47:34AM +1000, Dave Chinner wrote: > > > On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote: > > > > This patch series does some > > > > minor modification to the loop driver so that each cgroup can make > > > > forward progress independently to avoid this inversion. > > > > > > > > With this patch series applied, the above script triggers OOM kills > > > > when writing through the loop device as expected. > > > > > > NACK! > > > > > > The IO that is disallowed should fail with ENOMEM or some similar > > > error, not trigger an OOM kill that shoots some innocent bystander > > > in the head. That's worse than using BUG() to report errors... > > > > Did you actually read the script? > Before responding, I want to make it clear that the OOM behavior which you are objecting to is already present in the kernel and is independent of this patch series. This patch series is only connecting the charging links which were missing for the loop device. > Of course I did. You specifically mean this bit: > > echo 64M > $CGROUP/memory.max; > mount -t tmpfs -o size=512m tmpfs /tmp; > truncate -s 512m /tmp/backing_file > losetup $LOOP_DEV /tmp/backing_file > dd if=/dev/zero of=$LOOP_DEV bs=1M count=256; > > And with this patch set the dd gets OOM killed because the > /tmp/backing_file usage accounted to the cgroup goes over 64MB and > so tmpfs OOM kills the IO... > A few queries to better understand your objection: Let's forget about the cgroup for a second. Let's suppose I am running this script on a system/VM having 64 MiB. In your opinion what should happen? Next let's add the swap to the 64 MiB system/VM/cgroup and re-run the script, what should be the correct behavior? Next replace the tmpfs with any other disk backed file system and re-run the script in a 64 MiB system/VM/cgroup, what should be the correct behavior? > As I said: that's very broken behaviour from a storage stack > perspective. > > > It's OOMing because it's creating 256M worth of tmpfs pages inside a > > 64M cgroup. It's not killing an innocent bystander, it's killing in > > the cgroup that is allocating all that memory - after Dan makes sure > > that memory is accounted to its rightful owner. > > What this example does is turn /tmp into thinly provisioned storage > for $CGROUP via a restricted quota. It has a device size of 512MB, > but only has physical storage capacity of 64MB, as constrained by > the cgroup memory limit. You're dealing with management of -storage > devices- and -IO error reporting- here, not memory management. > > When thin provisioned storage runs out of space - for whatever > reason - and it cannot resolve the issue by blocking, then it *must* > return ENOSPC to the IO submitter to tell it the IO has failed. This > is no different to if we set a quota on /tmp/backing_file and it is > exhausted at 64MB of data written - we fail the IO with ENOSPC or > EDQUOT, depending on which quota we used. > > IOWs, we have solid expectations on how block devices report > unsolvable resource shortages during IO because those errors have to > be handled correctly by whatever is submitting the IO. We do not use > the OOM-killer to report or resolve ENOSPC errors. > > Indeed, once you've killed the dd, that CGROUP still consumes 64MB > of tmpfs space in /tmp/backing_file, in which case any further IO to > $LOOP_DEV is also going to OOM kill. These are horrible semantics > for reporting errors to userspace. > > And if the OOM-killer actually frees the 64MB of data written to > /tmp/backing_file through the $LOOP_DEV, then you're actually > corrupting the storage and ensuring that nobody can read the data > that was written to $LOOP_DEV. > > So now lets put a filesystem on $LOOP_DEV in the above example, and > write out data to the filesystem which does IO to $LOOP_DEV. Just by > chance, the filesystem does some metadata writes iin the context of > the user process doing the writes (because journalling, etc) and > that metadata IO is what pushes the loop device over the cgroup's > memory limit. > > You kill the user application even though it wasn't directly > responsible for going over the 64MB limit of space in $LOOP_DEV. > What happens now to the filesystem's metadata IO? Did $LOOP_DEV > return an error, or after the OOM kill did the IO succeed? What > happens if all the IO being generated from the user application is > metadata and that starts failing - killing the user application > doesn't help anything - the filesystem IO is failing and that's > where the errors need to be reported. > > And if the answer is "metadata IO isn't accounted to the $CGROUP" > then what happens when the tmpfs actually runs out of it's 512MB of > space because of all the metadata the filesystem wrote (e.g. create > lots of zero length files)? That's an ENOSPC error, and we'll get > that from $LOOP_DEV just fine. > > So why should the same error - running out of tmpfs space vs running > out of CGROUP quota space on tmpfs be handled differently? Either > they are both ENOSPC IO errors, or they are both OOM kill vectors > because tmpfs space has run out... > > See the problem here? We report storage resource shortages > (whatever the cause) by IO errors, not by killing userspace > processes. Userspace may be able to handle the IO error sanely; it > has no warning or choice when you use OOM kill to report ENOSPC > errors... > > > As opposed to before this series, where all this memory isn't > > accounted properly and goes to the root cgroup - where, ironically, it > > could cause OOM and kill an actually innocent bystander. > > Johannes, I didn't question the need for the functionality. I > questioned the implementation and pointed out fundamental problems > it has as well as the architectural questions raised by needing > special kthread-based handling for correct accounting of > cgroup-aware IO. > > It's a really bad look to go shoot the messenger when it's clear you > haven't understood the message that was delivered. > > -Dave. > -- > Dave Chinner > david@fromorbit.com