LinuxLists.cc - partial reiser4 review comments

2006-08-03 07:17:48

Subject: partial reiser4 review comments

I'm only partway through this, but let me unload my observations thus far.
I still need to find a chunk of time for a file-by-file walkthrough and
it's unobvious where that chunk will come from at present :(

reiser4 as found in 2.6.18-rc2-mm1.

- set_page_dirty_internal() pokes around in VFS internals. Use
__set_page_dirty_no_buffers() or create a new library function in
mm/page-writeback.c.

In particular, it gets the radix-tree dirty tagging out of sync.

- running igrab() in the writepage() path is really going to hammer
inode_lock. Something else will need to be done here.

- The preferred way of solving the above would be to mark the page as
PageWriteback() with set_page_writeback() prior to unlocking it. That'll
pin the page and the inode. It does require that the page actually get
written later on. If we cannot do that then more thought is needed.

- wbq.sem should be using a completion for the "wait until entd finishes",
not a semaphore. Because there's a teeny theoretical race when using
semaphores this way which completions were designed to avoid. (The waker
can still be playing with the semaphore when it has gone out of scope on the
wakee's stack).

- write_page_by_ent(): the "spin until entd thread" thing is gross.

This function is really lock-intensive.

- If poss, use wake_up_process() rather than wake_up(). That'll save some
locking.

- Running iput() in entd() is a bit surprising. iirc there are various ways
in which this can recur into the filesystem, perform I/O, etc. I guess it
works..

But again, it will hammer inode_lock.

- entd_flush(): bug:

rq->wbc->range_start = rq->page->index << PAGE_CACHE_SHIFT;

this can overflow on 32-bit. Need to cast rq->page->index to loff_t.

- the writeout logic in entd_flush() is interesting (as in "holy cow").
It's very central and really needs some good comments describing what's
going on in there - what problems are being solved, which decisions were
taken and why, etc.

The big comment in page_cache.c is useful. Please maintain it well.

Boy, it has some old stuff in it.

- writeout() is a poor name for a global function. Even things like
"txn_restart" are a bit generic-sounding. Low-risk, but the kernel's
getting bigger...

If it were mine, I'd prefix all these symbols with "r4_".

prepare_to_sleep(), page_io(), drop_page(), statfs_type(),
pre_commit_hook(), etc, etc, etc, etc. Much namespace pollution.

- reiser4_wait_page_writeback() needs commenting.

- reading the comment in txnmgr.c regarding MAP_SHARED pages: a number of
things have changed since then. We have page-becoming-writeable
notifications and probably soon we'll always take a pagefault when a
MAP_SHARED page transitions from pte-clean to pte-dirty (although I wouldn't
recommend that a filesystem rely upon the latter for a while yet).

- invalidate_list() is a poorly-chosen global identifier. We already have
an invalidate_list() in fs/inode.c, too.

Please audit all of reiser4's global identifiers (use nm *.o) for suitable
naming choices.

- semaphores are deprecated. Please switch to mutexes and/or completions
where appropriate and possible.

- page_cache.c: yes, mpage_end_io_write() and mpage_end_io_read() are pretty
generic - we might as well export them.

- drop_page() is a worry. Why _does_ reiser4 need to remove pages from
pagecache? That isn't a filesystem function.

drop_page() appears to leave the no-longer-present page tagged as dirty in
the radix-tree.

- truncate_jnodes_range() looks wrong to me. When we populate gang[], there
can be any number of NULL entries placed in it. But the loop which iterates
across the now-populated gang[] will bale out when it its the _first_ NULL
entry. Any following entries will have a leaked jref() against them.

- reiser4_invalidate_pages() is a mix of reiser4 things and of
things-which-the-vfs-is-supposed-to-do. It is uncommented and I am unable
to work out why it was necessary to do this, and hence what we can do about
it.

- reiser4_readpages() shouldn't need to clean up the remaining pages on
*pages. read_cache_pages() does that now.

- <wonders what formatted and unformatted nodes are> A brief glossary might
help.

- REISER4_ERROR_CODE_BASE actually overlaps real errnos (see
include/linux/errno.h). Suggest that it be changed to 1000000 or something.

- blocknr_set_add() modifies a list_head without any apparent locking.
Certainly without any _documented_ locking...

Ditto blocknr_set_destroy(). I'm sure there's locking, but it's harder
than it should be to work out what it is. Given that proper locking is in
place, the filesystem seems to use list_*_careful() a lot more than is
necessary?

- It would be clearer to remove `struct blocknr_set' and just use list_head.

- Waaaaaaaaaaaaaaay too many typedefs.

- There are many coding-style nits. One I will mention is very large number
of unneeded braces:

if (foo) {
bar();
}

it'd be nice to fix these up sometime.

- General comment: the lack of support for extended attributes, access
control lists and direct-io is a big problem and it's getting bigger. I
don't see how a vendor could support reiser4 without these features and
permanent lack of vendor support will hurt.

What's the plan here?

2006-08-03 14:26:48

by Christoph Hellwig

[permalink] [raw]

Subject: Re: partial reiser4 review comments

On Thu, Aug 03, 2006 at 12:17:41AM -0700, Andrew Morton wrote:
> - running igrab() in the writepage() path is really going to hammer
> inode_lock. Something else will need to be done here.

> - Running iput() in entd() is a bit surprising. iirc there are various ways
> in which this can recur into the filesystem, perform I/O, etc. I guess it
> works..
>
> But again, it will hammer inode_lock.

XFS used to do this and it caused lots of problems. What xfs does now
is to keep an iocount in the inode for outstanding I/Os and ->clear_inode
waits for that I/O to finish using hashed waitqueues. It would be nice
to have a facility like that in generic code.

> - reiser4_readpages() shouldn't need to clean up the remaining pages on
> *pages. read_cache_pages() does that now.

Without looking at the code I remember someone from the Namesys people told
me they could use plain mpage_readpages now. Anything still blocking using
that function?

> - General comment: the lack of support for extended attributes, access
> control lists and direct-io is a big problem and it's getting bigger. I
> don't see how a vendor could support reiser4 without these features and
> permanent lack of vendor support will hurt.
>
> What's the plan here?

Another issue is the lack of support for blocksize < pagesize. This prevents
it from beeing used across architectures. Even worse when I tried the last
time it didn't allow me to create a 64k blocksize filesystem that I could
actually test on ppc64.

2006-08-03 20:10:55

by Alexander Zarochentsev

[permalink] [raw]

Subject: Re: partial reiser4 review comments

> I'm only partway through this, but let me unload my observations thus
> far. I still need to find a chunk of time for a file-by-file
> walkthrough and it's unobvious where that chunk will come from at
> present :(
>
>
> reiser4 as found in 2.6.18-rc2-mm1.
>
> - set_page_dirty_internal() pokes around in VFS internals. Use
> __set_page_dirty_no_buffers() or create a new library function in
> mm/page-writeback.c.
>
> In particular, it gets the radix-tree dirty tagging out of sync.
>
> - running igrab() in the writepage() path is really going to hammer
> inode_lock. Something else will need to be done here.
>
> - The preferred way of solving the above would be to mark the page as
> PageWriteback() with set_page_writeback() prior to unlocking it.
> That'll pin the page and the inode. It does require that the page
> actually get written later on. If we cannot do that then more
> thought is needed.
>
> - wbq.sem should be using a completion for the "wait until entd
> finishes", not a semaphore. Because there's a teeny theoretical race
> when using semaphores this way which completions were designed to
> avoid. (The waker can still be playing with the semaphore when it
> has gone out of scope on the wakee's stack).
>
> - write_page_by_ent(): the "spin until entd thread" thing is gross.

that spinlock is especially against the "teeny theoretical race...".
good if completion will allow us to remove it.

>
> This function is really lock-intensive.

Thanks,
Alex.

2006-08-06 14:38:23

by Alexander Zarochentsev

[permalink] [raw]

Subject: Re: partial reiser4 review comments

On 3 August 2006 18:26, Christoph Hellwig wrote:
> On Thu, Aug 03, 2006 at 12:17:41AM -0700, Andrew Morton wrote:
> > - running igrab() in the writepage() path is really going to hammer
> > inode_lock. Something else will need to be done here.
> >
> > - Running iput() in entd() is a bit surprising. iirc there are
> > various ways in which this can recur into the filesystem, perform
> > I/O, etc. I guess it works..
> >
> > But again, it will hammer inode_lock.
>
> XFS used to do this and it caused lots of problems. What xfs does
> now is to keep an iocount in the inode for outstanding I/Os and
> ->clear_inode waits for that I/O to finish using hashed waitqueues.
> It would be nice to have a facility like that in generic code.
>
> > - reiser4_readpages() shouldn't need to clean up the remaining
> > pages on *pages. read_cache_pages() does that now.
>
> Without looking at the code I remember someone from the Namesys
> people told me they could use plain mpage_readpages now. Anything
> still blocking using that function?

reiser4 tries to reduce number of tree lookups. better, if there would
be one tree lookup for one readpages call.

what we are currently doing in a not-yet-submitted patch (below), i
don't see how it can be done by mpage_readpages.

+struct uf_readpages_context {
+ lock_handle lh;
+ coord_t coord;
+};
+
+/* A callback function for readpages_unix_file/read_cache_pages.
+ * It assumes that the file is build of extents.
+ *
+ * @data -- a pointer to reiser4_readpages_context object,
+ * to save the twig lock and the coord between
+ * read_cache_page iterations.
+ * @page -- page to start read.
+ */
+static int uf_readpages_filler(void * data, struct page * page)
+{
+ struct uf_readpages_context *rc = data;
+ jnode * node;
+ int ret = 0;
+ reiser4_extent *ext;
+ __u64 ext_index;
+ int cbk_done = 0;
+ struct address_space * mapping = page->mapping;
+
+ if (PageUptodate(page)) {
+ unlock_page(page);
+ return 0;
+ }
+ node = jnode_of_page(page);
+ if (unlikely(IS_ERR(node))) {
+ ret = PTR_ERR(node);
+ goto err;
+ }
+ if (rc->lh.node == 0) {
+ /* no twig lock - have to do tree search. */
+ reiser4_key key;
+ repeat:
+ unlock_page(page);
+ key_by_inode_and_offset_common(
+ mapping->host, page_offset(page), &key);
+ ret = coord_by_key(
+ &get_super_private(mapping->host->i_sb)->tree,
+ &key, &rc->coord, &rc->lh,
+ ZNODE_READ_LOCK, FIND_EXACT,
+ TWIG_LEVEL, TWIG_LEVEL, CBK_UNIQUE, NULL);
+ lock_page(page);
+ cbk_done = 1;
+ if (ret != 0)
+ goto err;
+ }
+ ret = zload(rc->coord.node);
+ if (ret)
+ goto err;
+ ext = extent_by_coord(&rc->coord);
+ ext_index = extent_unit_index(&rc->coord);
+ if (page->index < ext_index || page->index >= ext_index +
extent_get_width(ext))
+ {
+ /* the page index doesn't belong to the extent unit which the coord
points to,
+ * -- release the lock and repeat with tree search. */
+ zrelse(rc->coord.node);
+ done_lh(&rc->lh);
+ /* we can be here after a CBK call only in case of corruption of the
tree
+ * or the tree lookup algorithm bug. */
+ if (unlikely(cbk_done)) {
+ ret = RETERR(-EIO);
+ goto err;
+ }
+ goto repeat;
+ }
+ ret = do_readpage_extent(ext, page->index - ext_index, page);
+ zrelse(rc->coord.node);
+ if (ret) {
+err:
+ unlock_page(page);
+ }
+ jput(node);
+ return ret;
+}
+
+/**
+ * readpages_unix_file - called by the readahead code, starts reading
for each
+ * page of given list of pages
+ */
+int readpages_unix_file(
+ struct file *file, struct address_space *mapping,
+ struct list_head *pages, unsigned nr_pages)
+{
+ reiser4_context *ctx;
+ struct uf_readpages_context rc;
+ int ret;
+
+ ctx = init_context(mapping->host->i_sb);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+ init_lh(&rc.lh);
+ ret = read_cache_pages(mapping, pages, uf_readpages_filler, &rc);
+ done_lh(&rc.lh);
+ context_set_commit_async(ctx);
+ /* close the transaction to protect further page allocation from
deadlocks */
+ txn_restart(ctx);
+ reiser4_exit_context(ctx);
+ return ret;
+}

what it does do yet -- it doesn't use one bio for more than one page.

BTW, read_cache_page() and mpage_readpages are similar, I guess the
second can be rewritten using the first one, no?

> > - General comment: the lack of support for extended ttributes,
> > access control lists and direct-io is a big problem and it's
> > getting bigger. I don't see how a vendor could support reiser4
> > without these features and permanent lack of vendor support will
> > hurt.
> >
> > What's the plan here?
>
> Another issue is the lack of support for blocksize < pagesize. This
> prevents it from beeing used across architectures. Even worse when I
> tried the last time it didn't allow me to create a 64k blocksize
> filesystem that I could actually test on ppc64.

--
Alex.

2006-08-09 08:59:56

by Christoph Hellwig

[permalink] [raw]

Subject: Re: partial reiser4 review comments

On Sun, Aug 06, 2006 at 06:38:34PM +0400, Alexander Zarochentsev wrote:
> > > - reiser4_readpages() shouldn't need to clean up the remaining
> > > pages on *pages. read_cache_pages() does that now.
> >
> > Without looking at the code I remember someone from the Namesys
> > people told me they could use plain mpage_readpages now. Anything
> > still blocking using that function?
>
> reiser4 tries to reduce number of tree lookups. better, if there would
> be one tree lookup for one readpages call.

Right now mpage_redpages does one get_block per extent. I think it's
pretty messy do do one block allocator call that can return multiple
extents because that leads into a lot of complexity for a rather
questionable gain.(XFS on IRIX does that)

>
> what we are currently doing in a not-yet-submitted patch (below), i
> don't see how it can be done by mpage_readpages.
>
> +struct uf_readpages_context {
> + lock_handle lh;
> + coord_t coord;
> +};

I must admit that standalone code snipplet doesn't really tell me a lot.
Do you mean the possibility to pass around a filesystem-defined structure
to multiple allocator calls? I'm pretty sure can add that, I though it
would be useful multiple times in the past but always found ways around
it.

> BTW, read_cache_page() and mpage_readpages are similar, I guess the
> second can be rewritten using the first one, no?

Do you mean read_cache_page() or read_cache_pages() ?

(Yeah, it really bad that we have two functions sounding the same but doing
thing quite differently..)

read_cache_pages() could probably be folded into mpages_readpages by allowing
it to give an additional readpage callback similar to how mpage_writepages
can either do real direct to bio or be used as an interator over writepages
calls.

read_cache_page() is quite different from read_cache_pages() and
mpages_readpages in that it is synchronous and waits for the read to complete
before returning.

2006-08-09 10:18:42

by Hans Reiser

[permalink] [raw]

Subject: Re: partial reiser4 review comments

Christoph Hellwig wrote:

> I must admit that standalone code snipplet doesn't really tell me a lot.
>
>Do you mean the possibility to pass around a filesystem-defined structure
>to multiple allocator calls? I'm pretty sure can add that, I though it
>would be useful multiple times in the past but always found ways around
>it.
>
>
>
Assuming I understand your discussion, I see two ways to go, one is to
pass around fs specific state and continue to call into the FS many
times, and the other is to instead provide the fs with helper functions
that accomplish readahead calculation, page allocation, etc., and let
the FS keep its state naturally without having to preserve it in some fs
defined structure. The second approach would be cleaner code design,
that would also ease cross-os porting of filesystems, in my view.

As a philosophy of design issue, the current VFS and generic code leaves
me with the feeling that people are trying to write the FS rather than
trying to help the FS writer with some useful library functions.

2006-08-10 18:31:49

by Nate Diller

[permalink] [raw]

Subject: Re: partial reiser4 review comments

On 8/9/06, Hans Reiser <[email protected]> wrote:
> Christoph Hellwig wrote:
>
> > I must admit that standalone code snipplet doesn't really tell me a lot.
> >
> >Do you mean the possibility to pass around a filesystem-defined structure
> >to multiple allocator calls? I'm pretty sure can add that, I though it
> >would be useful multiple times in the past but always found ways around
> >it.
> >
> >
> >
> Assuming I understand your discussion, I see two ways to go, one is to
> pass around fs specific state and continue to call into the FS many
> times, and the other is to instead provide the fs with helper functions
> that accomplish readahead calculation, page allocation, etc., and let
> the FS keep its state naturally without having to preserve it in some fs
> defined structure. The second approach would be cleaner code design,
> that would also ease cross-os porting of filesystems, in my view.

the second approach is the one i was heading towards with my
unfinished a_ops patches. *please* won't someone pay me to do that
work...

NATE

2006-08-11 18:55:47

by Hans Reiser

[permalink] [raw]

Subject: Re: partial reiser4 review comments

Nate Diller wrote:

> On 8/9/06, Hans Reiser <[email protected]> wrote:
>
>> Christoph Hellwig wrote:
>>
>> > I must admit that standalone code snipplet doesn't really tell me a
>> lot.
>> >
>> >Do you mean the possibility to pass around a filesystem-defined
>> structure
>> >to multiple allocator calls? I'm pretty sure can add that, I though it
>> >would be useful multiple times in the past but always found ways around
>> >it.
>> >
>> >
>> >
>> Assuming I understand your discussion, I see two ways to go, one is to
>> pass around fs specific state and continue to call into the FS many
>> times, and the other is to instead provide the fs with helper functions
>> that accomplish readahead calculation, page allocation, etc., and let
>> the FS keep its state naturally without having to preserve it in some fs
>> defined structure. The second approach would be cleaner code design,
>> that would also ease cross-os porting of filesystems, in my view.
>
>
> the second approach is the one i was heading towards with my
> unfinished a_ops patches. *please* won't someone pay me to do that
> work...
>
> NATE
>
>
You might describe it in a paragraph or so instead of just mentioning
it.....;-)

2006-08-11 23:02:38

by Nate Diller

[permalink] [raw]

Subject: Re: partial reiser4 review comments

On 8/11/06, Hans Reiser <[email protected]> wrote:
> Nate Diller wrote:
>
> > On 8/9/06, Hans Reiser <[email protected]> wrote:
> >
> >> Christoph Hellwig wrote:
> >>
> >> > I must admit that standalone code snipplet doesn't really tell me a
> >> lot.
> >> >
> >> >Do you mean the possibility to pass around a filesystem-defined
> >> structure
> >> >to multiple allocator calls? I'm pretty sure can add that, I though it
> >> >would be useful multiple times in the past but always found ways around
> >> >it.
> >> >
> >> >
> >> >
> >> Assuming I understand your discussion, I see two ways to go, one is to
> >> pass around fs specific state and continue to call into the FS many
> >> times, and the other is to instead provide the fs with helper functions
> >> that accomplish readahead calculation, page allocation, etc., and let
> >> the FS keep its state naturally without having to preserve it in some fs
> >> defined structure. The second approach would be cleaner code design,
> >> that would also ease cross-os porting of filesystems, in my view.
> >
> >
> > the second approach is the one i was heading towards with my
> > unfinished a_ops patches. *please* won't someone pay me to do that
> > work...
> >
> > NATE
> >
> >
> You might describe it in a paragraph or so instead of just mentioning
> it.....;-)
>

start by making tree_lock (write) private, using the interface
detailed below. No one should be able to add/remove pages from the
address space without going through the a_ops interface. this patch
is part of a (much) larger unfinished, and outdated, set intended to
put this interface in place. people familiar with this code will
immediately note that there are a few hard problems to solve here,
most notably the various {truncate|invalidate}_mapping_pages calls,
and the locking involved. Cleaning up the inode reclaim paths a bit
should help this, and that work is unfinished as well.

I'm always hesitant to post stuff like this, because -ENOPATCH is
really an appropriate response here.

NATE

diff -urpN linux-2.6.15-rc5-mm1/include/linux/fs.h
linux-extent/include/linux/fs.h
--- linux-2.6.15-rc5-mm1/include/linux/fs.h 2005-12-10 16:49:30.000000000 -0800
+++ linux-extent/include/linux/fs.h 2006-08-11 15:47:19.000000000 -0700
@@ -340,18 +340,33 @@ struct address_space;
struct writeback_control;

struct address_space_operations {
- int (*writepage)(struct page *page, struct writeback_control *wbc);
- int (*readpage)(struct file *, struct page *);
- int (*sync_page)(struct page *);
-
- /* Write back some dirty pages from this mapping. */
- int (*writepages)(struct address_space *, struct writeback_control *);
-
/* Set a page dirty */
+ /* takes lock to move lists, update counts */
int (*set_page_dirty)(struct page *page);
+ /* get rid of this, all it does is unplug blkdev */
+ int (*sync_page)(struct page *);

+ /*
+ * Write back dirty pages / invalidate all pages that fall within
+ * the given page range (end byte inclusive). These only affect
+ * pages already cached in this mapping.
+ */
+ int (*writepages)(struct address_space *mapping,
+ struct writeback_control *wbc);
+ int (*invalidate_range)(struct address_space *mapping,
+ pgoff_t index, unsigned nr_pages);
+
+ /*
+ * Read / create pages within the given index extent. These
+ * silently skip any pages which are already cached in this mapping.
+ *
+ * Return the number of pages allocated within the range, or an error.
+ */
+ /* filp here is wierd */
int (*readpages)(struct file *filp, struct address_space *mapping,
- struct list_head *pages, unsigned nr_pages);
+ pgoff_t index, unsigned nr_pages);
+ int (*instantiate_range)(struct address_space *mapping, pgoff_t index,
+ unsigned nr_pages);

/*
* ext3 requires that a successful prepare_write() call be followed