Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755223Ab1CGEdw (ORCPT ); Sun, 6 Mar 2011 23:33:52 -0500 Received: from g5t0006.atlanta.hp.com ([15.192.0.43]:48061 "EHLO g5t0006.atlanta.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753934Ab1CGEdu (ORCPT ); Sun, 6 Mar 2011 23:33:50 -0500 X-Greylist: delayed 654 seconds by postgrey-1.27 at vger.kernel.org; Sun, 06 Mar 2011 23:33:50 EST Subject: Re: [PATCH] Fix over-zealous flush_disk when changing device size. From: Andrew Patterson To: NeilBrown Cc: Christoph Hellwig , Jens Axboe , linux-raid@vger.kernel.org, dm-devel@redhat.com, linux-kernel@vger.kernel.org, James.Bottomley@suse.de In-Reply-To: <20110306174755.49404c8e@notabene.brown> References: <20110217165057.5c50e566@notabene.brown> <20110303143120.GA8134@infradead.org> <20110304111624.4be27aaf@notabene.brown> <1299259506.2118.24.camel@grinch> <20110306174755.49404c8e@notabene.brown> Content-Type: text/plain; charset="UTF-8" Date: Sun, 06 Mar 2011 21:22:51 -0700 Message-ID: <1299471771.2228.11.camel@grinch> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9086 Lines: 200 On Sun, 2011-03-06 at 17:47 +1100, NeilBrown wrote: > On Fri, 04 Mar 2011 10:25:06 -0700 Andrew Patterson > wrote: > > > On Fri, 2011-03-04 at 11:16 +1100, NeilBrown wrote: > > > On Thu, 3 Mar 2011 09:31:20 -0500 Christoph Hellwig wrote: > > > > > > > On Thu, Feb 17, 2011 at 04:50:57PM +1100, NeilBrown wrote: > > > > > > > > > > Hi Andrew (and others) > > > > > I wonder if you would review the following for me and comment. > > > > > > > > Please send think in this area through -fsdevel next time, thanks! > > > > > > Will try to remember - it is sometimes hard to get this sort of patch before > > > the right audience ... I thought "block layer" rather than "file systems" :-( > > > > > > Thanks for finding it anyway. > > > > > > > > > > > > There are two cases when we call flush_disk. > > > > > In one, the device has disappeared (check_disk_change) so any > > > > > data will hold becomes irrelevant. > > > > > In the oter, the device has changed size (check_disk_size_change) > > > > > so data we hold may be irrelevant. > > > > > > > > > > In both cases it makes sense to discard any 'clean' buffers, > > > > > so they will be read back from the device if needed. > > > > > > > > Does it? If the device has disappeared we can't read them back anyway. > > > > > > I think that is the point - return an error rather than stale data. > > > > > > > If the device has resized to a smaller size the same is true about > > > > those buffers that have gone away, and if it has resized to a larger > > > > size invalidating anything doesn't make sense at all. I think this > > > > area needs more love than a quick kill_dirty hackjob. > > > > > > I tend to agree. I wasn't entirely convinced by the changelog comments on > > > the original offending patch, but I couldn't convince myself there was no > > > justification either, and I wanted to fix the corruption I saw - while close > > > to the end of a release cycle - without introducing any new regressions. > > > > > > > > > > > > In the former case it makes sense to discard 'dirty' buffers > > > > > as there will never be anywhere safe to write the data. In the > > > > > second case it *does*not* make sense to discard dirty buffers > > > > > as that will lead to file system corruption when you simply enlarge > > > > > the containing devices. > > > > > > > > Doing anything like this at the buffer cache layer or inode cache layer > > > > doesn't make any sense. If a device goes away or shrinks below the > > > > filesystem size the filesystem simply needs to be shut down and in te > > > > former size the admin needs to start a manual repair. Trying to do > > > > any botch jobs in lower layer never works in practice. > > > > > > Amen. > > > What I personally would really like to see is an interface for the block > > > device to say to the filesystem (or more specifically: whatever has bdclaimed > > > it) "I am about to resize to $X - is that OK?" and also "I have resized - > > > deal with it". > > > > > > > > > > > For now I think the best short term fix is to simply revert commit > > > > 608aeef17a91747d6303de4df5e2c2e6899a95e8 > > > > > > > > "Call flush_disk() after detecting an online resize." > > > > > > You may be right, but I suspect that Andrew Patterson had a real issue to > > > solve which lead to submitting it, and I'd really like to understand that > > > issue before I would feel confident just reverting it. > > > > > > Andrew: are you out there? Can you provide some background for your patch? > > > > I put in the flush disk stuff at the suggestion of James Bottomley. In > > fact the text for the justification in 608aeef17a91747d6303 is mostly > > his. The idea is to get errors reported immediately rather than waiting > > around for them to eventually get flushed and to make sure stale data is > > not kept around. Certainly, at a minimum, not keeping stale data around > > seems valuable to me. > > > > What parts of the original justification did you think are unconvincing? > > Note that the flush for growing the device is really only there for the > > degenerate case where someone might shrink then grow a device > > (admittedly, the user probably deserves to get data corruption/security > > holes in such a case). > > hi Andrew. > > One of the things that I didn't like about the change log is that it didn't > give clear context - what exactly is the problem it is trying to fix? Well true, but the rest of the patch set might have given the context. > When you talk about "disks" changing size (reduced radius:-?) I think first > of 'dm' and 'md' - yet it clearly isn't dm related as dm doesn't even use > that code, and if it was 'md' related I would have thought I would have > heard about it.... > So presumably these are some SCSI-attached devices that do internal volume > management and can change the size of .... targets? LUNs? something like > that. Yes. The idea is that you want to increase/decrease the size of the underlying block device (usually a FC or iSCSI LUN). Short of a reboot, the OS will not detect the size changed. I added some code that rescans the LUN and gets the new size and report the new size to the block layer. > So can these things change size without the SCSI layer immediately knowing > about it?? Don't you get some sort of "unit attention" or something? > In some cases with some hardware yes. In most, no. > The idea that the device might reduce in size and then grow again seems just > plain wrong. This would usually be due to some user error. The user may have shrunk the device too much, realized there error and then corrected it. I admit that it is pretty far-fetched. > If that is possible it could return to it's original size and > you would never notice? Yes. > And if there are cases that you know you will never > notice, then it seems like it is the wrong solution. > It would have been more credible if it *only* tried to flush when the size > was reduced ... which you do suggest as a possibility above I think. > > > The "potential security hole" seems completely bogus. > If you give me permission to read something, and I do, then you remove that > permission, the fact that I still remember what I read is not a security > hole. It is a natural fact of life I think the case here might be where you add a LUN onto the space where the previous LUN space existed. Again, not likely. > > As for the two cases: I would describe them: > 1. planned. The fs is already shrunk to within the new boundary so there > is nothing to be gained by flushing, and we could have to reload a > pile of data from storage which would be pointless This is the security case. > 2. unplanned. The fs is probably toast, so whether we invalidate or not > isn't going to make a whole lot of difference. Agreed. > > So I would vote for not flushing. > > The "not keeping stale data around" argument ... the only data that is > clearly stale is data that is in the block-device cache and beyond the new > EOF. Most filesystems keep most data in the page cache rather than the > block device cache so this would just be some metadata - maybe inode tables > or block usage bitmaps or similar. And caches regularly keep data around > that isn't actually going to be used - so keeping a bit more on the rare > occasion of a block-device-resize doesn't seem like an important cost. > > Getting errors a little bit earlier in the case of an unplanned shrink is > possibly a credible argument. This would be read errors only of course - > write errors would still arrive at the same time. I'm not sure it would > really be very much earlier...maybe a bit in some cases. > > On the whole, the arguments both for and against this change - in principle > - seem rather weak: "maybe" and "possibly" rather than something concrete. > > So given that (due to an oversight) it actually causes filesystem > corruption, I tend to agree with Christoph that the best approach at this > stage is the revert the original patch, and then review all the related code > and come up with a "correct" approach. > > > And just to clarify: when I first found that description "unconvincing" I > hadn't thought through all of these issues - I think it was just that the > justification seems vague rather than concrete, so it was hard to reason > about it. > > > Would you be uncomfortable if I asked Linus to revert both my fix and your > original patch?? James Bottomley wanted me to put this functionality in. I have no problem with reverting it myself, especially if it is causing other problems. I would have to say that you need to ask him (or rather, I am not qualified to render an opinion here). Andrew -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/