Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755896AbZADPVI (ORCPT ); Sun, 4 Jan 2009 10:21:08 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751441AbZADPUv (ORCPT ); Sun, 4 Jan 2009 10:20:51 -0500 Received: from gw-ca.panasas.com ([66.104.249.162]:14357 "EHLO laguna.int.panasas.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751230AbZADPUu (ORCPT ); Sun, 4 Jan 2009 10:20:50 -0500 Message-ID: <4960D3CA.2000202@panasas.com> Date: Sun, 04 Jan 2009 17:20:42 +0200 From: Boaz Harrosh User-Agent: Thunderbird/3.0a2 (X11; 2008072418) MIME-Version: 1.0 To: James Bottomley , Matthew Wilcox , Benny Halevy , Jeff Garzik CC: Andrew Morton , Al Viro , Avishay Traeger , open-osd development , linux-scsi , linux-kernel , linux-fsdevel Subject: Re: [PATCH 7/9] exofs: mkexofs References: <4947BFAA.4030208@panasas.com> <4947CA5C.50104@panasas.com> <20081229121423.efde9d06.akpm@linux-foundation.org> <495B8D90.1090004@panasas.com> <1230739053.3408.74.camel@localhost.localdomain> In-Reply-To: <1230739053.3408.74.camel@localhost.localdomain> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 04 Jan 2009 15:20:42.0651 (UTC) FILETIME=[021416B0:01C96E80] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7769 Lines: 156 James Bottomley wrote: > On Wed, 2008-12-31 at 17:19 +0200, Boaz Harrosh wrote: >> Andrew Morton wrote: >>> On Tue, 16 Dec 2008 17:33:48 +0200 >>> Boaz Harrosh wrote: >>> >>>> We need a mechanism to prepare the file system (mkfs). >>>> I chose to implement that by means of a couple of >>>> mount-options. Because there is no user-mode API for committing >>>> OSD commands. And also, all this stuff is highly internal to >>>> the file system itself. >>>> >>>> - Added two mount options mkfs=0/1,format=capacity_in_meg, so mkfs/format >>>> can be executed by kernel code just before mount. An mkexofs utility >>>> can now be implemented by means of a script that mounts and unmount the >>>> file system with proper options. >>> Doing mkfs in-kernel is unusual. I don't think the above description >>> sufficiently helps the uninitiated understand why mkfs cannot be done >>> in userspace as usual. Please flesh it out a bit. >> There are a few main reasons. >> - There is no user-mode API for initiating OSD commands. Such a subsystem >> would be hundredfold bigger then the mkfs code submitted. I think it would be >> hard and stupid to maintain a complex user-mode API just for creating >> a couple of objects and writing a couple of on disk structures. > > This is really a reflection of the whole problem with the OSD paradigm. Certainly not a problem of the OSD paradigm, just maybe a problem of the current code boundaries laid out by years of block-devices. > In theory, a filesystem on OSD is a thin layer of metadata mapping > objects to files. Get this right and the storage will manage things, - objects to files. Get this right and the storage will manage things, + files to objects. Get this right and the storage will manage things, [objects to files is what some of the osd-targets do.] > like security and access and attributes (there's even a natural mapping > to the VFS concept of extended attributes). Plus, the storage has > enough information to manage persistence, backups and replication. > Sounds perfect to me. > The real problem is that no-one has actually managed to come up with a > useful VFS<->OSD mapping layer (even by extending or altering the VFS). > Every filesystem that currently uses OSD has a separate direct OSD > speaking interface (i.e. it slices out the block layer to do this and > talks directly to the storage). I'm not sure what you mean. Lets take VFS<->BLOCKS mapping for example. Each FS has it's own interpretation of what that means, brtfs is less perfect then xfs or vice versa? I guess you did not mean "mapping" but meant "Interface" or API. (or more likely I misunderstand the meaning of "mapping" ;) Well that is exactly what I was attempting to submit. A general-purpose low-level but easy-to-use, objects API for kernel clients. be it a dead-simple exofs, or a complex multi-head beast like a pNFS-Objects file system. The same library/API/Interface will be used for NFS-Clients NFSD-Servers, reconstruction, security what ever. The block-layer is not sliced out, Only the elevator function is, since BIO merging, if any, are not device global but per-object/file, and the elevator does not currently support that. (Profiling shows that it will be needed) BTW. The block-based filesystems are just a big minority in Kernel. The majority does not use block-layer either. > > I suppose this could be taken to show that such a layer is impossibly > complex, as you assert, but its lack is reflected in strange looking > design decisions like in-kernel mkfs. It would also mean that there > would be very little layered code sharing between ODS based filesystems. - would be very little layered code sharing between ODS based filesystems. + would be very little layered code sharing between OSD based filesystems. I disagree. All the OSD-Based file systems (In Linux) should absolutely only use the open-osd library submitted. I myself will work on a couple. If anything is missing that could not be added later, I would like to know about it. User-mode Interface is another matter. There are some ideas and some already implemented. [Hosted on open-osd.org see: http://git.open-osd.org/gitweb.cgi?p=osc-osd/.git;a=summary look inside the osd-initiator directory] And I have a toy interface that adds no new entries into the Kernel in the form of an OSDVFS module, that will let you access the raw OSD device through the VFS name-space. The lack of any user-mode API is just the lack of any current need/priority, or that I'm the only one working on OSD. But nothing that could not be solved in two weeks of pragmatic work. Surly it's not a paradigm problem. > >> - I intend to refactor the code further to make use of more super.c services, >> so to make this addition even smaller. Also future direction of raid over >> multiple objects will make even more kernel infrastructure needed which >> will need even more user-mode code duplication. >> - I anticipate problems that are not yet addressed in this body of work >> but will be in the future, mainly that a single OSD-target (lun) can >> be shared by lots of FSs, and a single FS can span many OSD-targets. >> Some central management is much easier to do in Kernel. >> >>> What are the dependencies for this filesystem code? I assume that it >>> depends on various block- and scsi-level patches? Which ones, and >>> what is their status, and is this code even compileable without them? >>> >> This OSD-based file system is dependent on the open-osd initiator library >> code that I've submitted for inclusion for 2.6.29. It has been sitting >> in linux-next for a while now, and has not been receiving any comments >> for the last two updated patchsets I've sent to scsi-misc/lkml. However >> it has not yet been submitted into Jame's scsi-misc git tree, and James >> is the ultimate maintainer that should submit this work. I hope it will >> still be submitted into 2.6.29, as this code is totally self sufficient >> and does not endangers or changes any other Kernel subsystems. >> (All the needed ground work was already submitted to Linus since 2.6.26) >> So why should it not? > > I don't like it mainly because it's not truly a useful general framework > for others to build on. However, as argued above, there might not > actually be such a useful framework, so as long as the only two > consumers (you and Lustre) want an interface like this, I'll put it in. > Time will tell, but I believe the exact opposite. I believe and strive for this OSD body of work to be useful for anybody that needs to talk T10-OSD in Linux, be it for any-purpose. Any thing missing should be easily added. > James > > To summarize the way I see it: - James is right in that we can not currently see the full OSD picture since we do not have a user-mode API, so the usefulness of it all is unclear. [I will send an RFD soon, and hope all interested will chime in on the discussion] - That said, all the submitted code is still relevant and useful, though at few places it takes the route of pragmatic-easy vs long-term-correctness. [Which can be fixed] - exofs/OSD is not the first FS that depends on a none-block-dev/its-own stack. The lower level (OSD) is represented to kernel as a char-dev + Additional API, common to other FS/stack models. Though the lower OSD level has the potential to be a generic layer that can be used by lots of users and use cases, not only FS type. Thank you James for your consideration Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/