From: Amir Goldstein Subject: Re: CrashMonkey: A Framework to Systematically Test File-System Crash Consistency Date: Wed, 16 Aug 2017 23:27:20 +0300 Message-ID: References: <20170815173349.GA17774@li70-116.members.linode.com> <20170816130607.GA1347@destiny> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Cc: Josef Bacik , Ext4 , linux-xfs , linux-fsdevel , Linux Btrfs , Ashlie Martinez , kernel-team@fb.com To: Vijay Chidambaram Return-path: In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Wed, Aug 16, 2017 at 10:06 PM, Vijay Chidambaram wrote: > Hi Josef, > > Thank you for the detailed reply -- I think it provides several > pointers for our future work. It sounds like we have a similar vision > for where we want this to go, though we may disagree about how to > implement this :) This is exciting! > > I agree that we should be building off existing work if it is a good > option. We might end up using log-writes, but for now we see several > problems: > > - The log-writes code is not documented well. As you have mentioned, > at this point, only you know how it works, and we are not seeing a lot > of adoption by other developers of log-writes as well. > > - I don't think our requirements exactly match what log-writes > provides. For example, at some point we want to introduce checkpoints > so that we can co-relate a crash state with file-system state at the > time of crash. We also want to add functionality to guide creation of > random crash states (see below). This might require changing > log-writes significantly. I don't know if that would be a good idea. > > Regarding random crashes, there is a lot of complexity there that > log-writes couldn't handle without significant changes. For example, > just randomly generating crash states and testing each state is > unlikely to catch bugs. We need a more nuanced way of doing this. We > plan to add a lot of functionality to CrashMonkey to (a) let the user > guide crash-state generation (b) focus on "interesting" states (by > re-ordering or dropping metadata). All of this will likely require > adding more sophistication to the kernel module. I don't think we want > to take log-writes and add a lot of extra functionality. > > Regarding logging writes, I think there is a difference in approach > between log-writes and CrashMonkey. We don't really care about the > completion order since the device may anyway re-order the writes after > that point. Thus, the set of crash states generated by CrashMonkey is > bound only by FUA and FLUSH flags. It sounds as if log-writes focuses > on a more restricted set of crash states. > > CrashMonkey works with the 4.4 kernel, and we will try and keep up > with changes to the kernel that breaks CrashMonkey. CrashMonkey is > useless without the user-space component, so users will be needing to > compile some code anyway. I do not believe it will matter much whether > it is in-tree or not, as long as it compiles with the latest kernel. > > Regarding discard, multi-device support, and application-level crash > consistency, this is on our road-map too! Our current priority is to > build enough scaffolding to reproduce a known crash-consistency bug > (such as the delayed allocation bug of ext4), and then go on and try > to find new bugs in newer file systems like btrfs. > > Adding CrashMonkey into the kernel is not a priority at this point (I > don't think CrashMonkey is useful enough at this point to do so). When > CrashMonkey becomes useful enough to do so, we will perhaps add the > device_wrapper as a DM target to enable adoption. > > Our hope currently is that developers like Ari will try out > CrashMonkey in its current form, which will guide us as to what > functionality to add to CrashMonkey to find bugs more effectively. > Vijay, I can only speak for myself, but I think I represent other filesystem developers with this response: - Often with competing projects the end results is always for the best when project members cooperate to combine the best of both projects. - Some of your project goals (e.g. user guided crash states) sound very intriguing - IMO you are severely underestimating the pros in mainlined kernel code for other developers. If you find the dm-log-writes target is lacking functionality it would be MUCH better if you work to improve it. Even more - it would be far better if you make sure that your userspace tools can work also with the reduced functionality in mainline kernel. - If you choose to complete your academic research before crossing over to existing code base, that is a reasonable choice for you to make, but the reasonable choice for me to make is to try Joseph's tools from his repo (even if not documented) and *only* if it doesn't meet my needs I would make the extra effort to try out CrashMonkey. - AFAIK the state of filesystem crash consistency testing tools is so bright (maybe except in Facebook ;) , so my priority is to get *some* automated testing tools in motion In any case, I'm glad this discussion started and I hope it would expedite the adoption of crash testing tools. I wish you all the best with your project. Amir.