Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752760AbcLMX3a (ORCPT ); Tue, 13 Dec 2016 18:29:30 -0500 Received: from [195.159.176.226] ([195.159.176.226]:59424 "EHLO blaine.gmane.org" rhost-flags-FAIL-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1751550AbcLMX32 (ORCPT ); Tue, 13 Dec 2016 18:29:28 -0500 X-Injected-Via-Gmane: http://gmane.org/ To: linux-kernel@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: page allocation stall in kernel 4.9 when copying files from one btrfs hdd to another Date: Tue, 13 Dec 2016 23:28:57 +0000 (UTC) Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Complaints-To: usenet@blaine.gmane.org User-Agent: Pan/0.141 (Tarzan's Death; GIT 194f2dc09) Cc: linux-btrfs@vger.kernel.org Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6024 Lines: 120 David Arendt posted on Tue, 13 Dec 2016 21:26:04 +0100 as excerpted: > The crash is not an isolated one as I already had this crash multiple > times with -rc7 and -rc8. It seems only to occur when copying from > 7200rpm harddisks to 5600rpm ones, and never when copying between two > 7200rpm or two 5400rpm. That reads very much like a bug previously reported here and on LKML itself (with Linus and other high-level kernel devs responding) that resulted in a(nother) discussion of whether the writecache knobs in /proc/ sys/dirty_* should be updated. It's generally accepted wisdom among kernel devs and sysadmins[1] that the existing dirty* write-cache defaults, set at a time when common system memories measured in the MiB, not the GiB of today, are no longer appropriate and should be lowered, but the lack of agreement as to precisely what the settings should be, combined with inertia and the lack of practical pressure given that those who know about the problem have long since adjusted their own systems accordingly, means the existing now generally agreed to be inappropriate defaults continue to remain. =:^( These knobs can be tweaked in several ways. For temporary experimentation, it's generally easiest to write (as root) updated values directly to the /proc/sys/vm/dirty_* files themselves. Once you find values you are comfortable with, most distros have an existing sysctl config[2] that can be altered as appropriate, so the settings get reapplied at each boot. Various articles with the details are easily googled so I'll be brief here, but here's the apropos settings and comments from my own /etc/sysctl.conf and a brief explanation: # write-cache, foreground/background flushing # vm.dirty_ratio = 10 (% of RAM) # make it 3% of 16G ~ half a gig vm.dirty_ratio = 3 # vm.dirty_bytes = 0 # vm.dirty_background_ratio = 5 (% of RAM) # make it 1% of 16G ~ 160 M vm.dirty_background_ratio = 1 # vm.dirty_background_bytes = 0 # vm.dirty_expire_centisecs = 2999 (30 sec) # vm.dirty_writeback_centisecs = 499 (5 sec) # make it 10 sec vm.dirty_writeback_centisecs = 1000 The *_bytes and *_ratio files configure the same thing in different ways, ratio being percentage of RAM, bytes being... bytes. Set one or the other as you prefer and the other one will be automatically zeroed out. The vm.dirty_background_* settings control when the kernel starts lower priority flushing, while high priority vm.dirty_* (not background) settings control when the kernel forces threads trying to do further writes to wait until some currently in-flight writes are completed. But those values only apply to size up until the expiry time has occurred, at which point writeback is still forced. That's where that setting comes in. The problem is that memory has gotten bigger much faster than the speed of actually writing out to slow spinning rust has increased. (Fast ssds have far less issues in this regard, tho slow flash like common USB thumb drives remain affected, indeed, sometimes even more so.) Common random- write spinning rust write speeds are 100 MiB/sec and may be as low as 30 MiB/sec. Meanwhile, the default 10% dirty_ratio, at 16 GiB memory size, approaches[3] 1.6 GiB, ~1600 MiB. At 100 MiB/sec that's 16 seconds worth of writeback to clear. At 30 MiB/sec, that's... well beyond the 30 second expiry time! To be clear, there's still a bug if the system crashes as a result -- the normal case should simply be a system that at worst doesn't respond for the writeback period, to be sure a problem in itself when that period exceeds double-digit seconds, but surely less of one than a total crash, as long as the system /does/ come back after perhaps half a minute or so. Anyway, as you can see from the above excerpt from my own sysctl.conf, for my 16 GiB system, I use a much more reasonable 1% background writeback trigger, ~160 MiB on 16 GiB, and 3% high-priority/foreground, ~ half a GiB on 16 GiB. I actually set those long ago, before I switched to btrfs and before I switched to ssd as well, but even tho ssd should work far better with the defaults than spinning rust does, those settings don't hurt on ssd either, and I've seen no reason to change them. So try 1% background and 3% foreground flushing ratios on your 32 GiB system as well, and see if that helps, or possibly try setting the _bytes values instead, since 1% is still quite huge in writeback time terms, on 32 GiB. Tweaking those down on the previously reported bug certainly helped there as he couldn't reproduce after that, and it looks like you're running 2+ GiB dirty based on your posted meminfo now, so it should reduce that, and hopefully eliminate the trigger for you, tho of course it won't fix the root bug. As I said it shouldn't crash in any case, even if it goes unresponsive for half a minute or so at a time, so there's certainly a bug to fix, but that will hopefully let you work without running into it. Again, you can write the new values direct to the proc interface without rebooting, for experimentation. Once you find values appropriate for you, however, write them to sysctl.conf or whatever your distro uses instead, so they get applied automatically at each boot. --- [1] Sysadmins: Like me, no claim to dev here, nor am I a professional sysadmin, but arguably I do take the responsibility of adminning my own systems more seriously than most appear to, enough to claim sysadmin as an appropriate descriptor. [2] Sysctl config. Look in /etc/sysctl.d/* and/or /etc/sysctl.conf, as appropriate to your distro. [3] Approaches: The memory figure used for calculating this percentage excludes some things so it won't actually reach 10% of total memory. But the exclusions are small enough that they can be hand-waved away for purposes of this discussion. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman