Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp946404pxa; Sat, 1 Aug 2020 11:13:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwoVWcGf5/6GLGLRGJzj5PMiX39zr8/ctfnj9svUFBvHH5NYoJ+iekI5YzG5NYrOW+1mF/f X-Received: by 2002:a17:906:a050:: with SMTP id bg16mr9276947ejb.21.1596305630304; Sat, 01 Aug 2020 11:13:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1596305630; cv=none; d=google.com; s=arc-20160816; b=wpqUh5USwBMqxTelu2qVQIxBm1wv8PPi3n47mVm7yIb5sA78Sx9/o09cZujG3quIQ0 LIakDS+h5Zm1zZelOSu6UV5ZBY0/7YACOd76P+PROxjlOwwxXjfdSMPvN/kOa2dqRDyC e7yTX7G0TjYoQCcpqTr8FbM6gZZJ04yz8IuckLubEILaEuoxSvQ7nupdLPSTNq3/VrFw pX1xGf5BYSFjNbGxUWsI+1AVT35/gSmKU5GYXILM6jrPpoaOslQj4tSecS4pqF9TwDsz XY6HSB8NY88thy+77fK6wZnRNee2fxjCGlMgYRjT9CyMD5meKvrQyrt3WpDLhfQIxOh7 xLYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:references:cc:to:from:subject; bh=jtYQoaVMi0NFavwykk3I/yx+SA5Mo6QW3BqyPHAbSdY=; b=jPXGKgK1PuqMTOkBGA92DwYybt9rY8Vi2JjxE4cOB7g1hHuvwARGsFtfnUNeD963Tc dkUgWgjTtshF7tHD5DjEUIPnyRkvvkOXw17U2n2N4P84nB4u1mtogp1xsxikCo4sSvZm Al/ia02LZTmHUWdz0n8oY7ESfz26QOGdBwn7QaZlQNESZ0TsJ3QkcYNgw+d3GccVshuD wH3phSLQ/nO0jPdBq7WaVlm+Y9L3vyvxLNihw5y8i3faJt6PlSdispSH+7gzKSlHYzim sXxYwzOhmkFetpyF24swgjPXDQA1Z0PrbAaKB7h6XwEovs07kaZeCsKhO/d00nSPaVnz iJQg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id n27si2742896eji.463.2020.08.01.11.13.27; Sat, 01 Aug 2020 11:13:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727109AbgHASK1 (ORCPT + 99 others); Sat, 1 Aug 2020 14:10:27 -0400 Received: from mx3.molgen.mpg.de ([141.14.17.11]:43661 "EHLO mx1.molgen.mpg.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726901AbgHASK1 (ORCPT ); Sat, 1 Aug 2020 14:10:27 -0400 Received: from [192.168.0.4] (ip5f5aeff1.dynamic.kabel-deutschland.de [95.90.239.241]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: buczek) by mx.molgen.mpg.de (Postfix) with ESMTPSA id 67A172064608E; Sat, 1 Aug 2020 20:10:23 +0200 (CEST) Subject: Re: xfs_reclaim_inodes_ag taking several seconds From: Donald Buczek To: Dave Chinner Cc: darrick.wong@oracle.com, linux-xfs@vger.kernel.org, LKML References: <8284912e-b99a-31af-1901-a38ea03b8648@molgen.mpg.de> <20200731223255.GG2005@dread.disaster.area> Message-ID: <0f0416a5-2af8-b4be-d53e-6af61082424a@molgen.mpg.de> Date: Sat, 1 Aug 2020 20:10:22 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01.08.20 12:25, Donald Buczek wrote: > So if I understand you correctly, this is expected behavior with this kind of load and conceptual changes are already scheduled for kernel 5.9. I don't understand most of it, but isn't it true that with that planned changes the impact might be better limited to the filesystem, so that the performance of other areas of the system might improve? I'd love to test that with our load, but I don't want to risk our backup data and it would be difficult to produce the same load on a toy system. The patch set is not yet ready to be tested on production data, is it? > > So I guess I'll try to put the backup processes into one or more cgroups to limit the memory available for their fs caches and leave some room for unrelated (maintenance) processes. I hope, that makes sense. Which it doesn't, because it totally ignores what was said before. The affected processes were not waiting for memory but for the shrinker to finish. D. > > Thank you both four your analysis! > > Donald > >> And if you have 30 million inodes in memory, and lots of them are >> dirty, and the shrinkers are running, then they will be doing >> dirty inode writeback to throttle memory reclaim to >> ensure it makes progress and doesn't declare OOM and kill processes >> permaturely. >> >> You have spinning disks, RAID6. I'm betting that it can only clean a >> couple of hundred inodes a second because RAID6 is severely IOP >> limited for small writes (like inode clusters). And when you many, >> many thousands (maybe millions) of dirty inodes, anything that has >> to wait on inode writeback is going to be waiting for some time... >> >>> root:done:/home/buczek/linux_problems/shrinker_semaphore/# xfs_info /amd/done/C/C8024 >>> meta-data=/dev/md0               isize=512    agcount=102, agsize=268435328 blks >>>           =                       sectsz=4096  attr=2, projid32bit=1 >>>           =                       crc=1        finobt=1, sparse=1, rmapbt=0 >>>           =                       reflink=0 >>> data     =                       bsize=4096   blocks=27348629504, imaxpct=1 >>>           =                       sunit=128    swidth=1792 blks >>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1 >>> log      =internal log           bsize=4096   blocks=521728, version=2 >> >> And full size journals, so the filesystem can hold an awful lot of >> active dirty inodes in memory before it starts throttling on a full >> journal (think millions of dirty inodes per filesystem)... >> >> So, yeah, this is the classic "in memory operation is orders of >> magnitude faster than disk operation" and it all comes crashing down >> when something needs to wait for inodes to be written back. The >> patchset Darrick pointed you at should fix the shrinker issue, but >> it's likely that this will just push the problem to the next >> throttling point, which is the journal filling up. >> >> IOWs, I suspect fixing your shrinker problem is only going to make >> the overload of dirty inodes in the system behave worse, because >> running out of journal space cause *all modifications* to the >> filesystem to start taking significant delays while they wait for >> inode writeback to free journal space, not just have things >> trying to register/unregister shrinkers take delays... >> >> Cheers, >> >> Dave. >> > -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433