Message-ID: <3DFA53DA.DE6788C1@digeo.com>
Date: Fri, 13 Dec 2002 13:40:42 -0800
From: Andrew Morton <akpm@digeo.com>
MIME-Version: 1.0
To: Hans Reiser <reiser@namesys.com>
CC: Linus Torvalds <torvalds@transmeta.com>,
       Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [BK][PATCH] ReiserFS CPU and memory bandwidth efficient large writes
References: <3DFA2D4F.3010301@namesys.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5074
Lines: 131

Hans Reiser wrote:
> ..
>     These three changesets implement reiserfs_file_write.

What a big patch.

I'd be interested in testing it a bit.  Could you please
describe under what circumstances the new code paths are
executed?  Is it simply appending to a normal old file?  No
special mount options?


A few little comments, and maybe a security problem....


> +int reiserfs_can_fit_pages ( struct super_block *sb /* superblock of filesystem
> +						       to estimate space */ )
> +{
> +	unsigned long space;
> +
> +	spin_lock(&REISERFS_SB(sb)->bitmap_lock);
> +	space = (SB_FREE_BLOCKS(sb) - REISERFS_SB(sb)->reserved_blocks) / ( PAGE_CACHE_SIZE/sb->s_blocksize);
> +	spin_unlock(&REISERFS_SB(sb)->bitmap_lock);
> +
> +	return space;
>  }

Both of the divisions here can be replaced with shifts.  Divides are expensive.

The locking here is peculiar:

	spin_lock(lock);
	value = calculation();
	spin_unlock(lock);
	return value;

Surely, if the calculation relies on the lock then the returned value is invalidated
as soon as you drop the lock?

> -                    offset = le_ih_k_offset( ih ) + (old_len - tb->rbytes );
> +                    offset = le_ih_k_offset( ih ) + (old_len - tb->rbytes )*(is_indirect_le_ih(ih)?tb->tb_sb->s_blocksize/UNFM_P_SIZE:1);

Can use a shift.

> +			  if (is_indirect_le_key(version,B_N_PKEY(tb->R[0],0))){
> +			      temp_rem = (n_rem / UNFM_P_SIZE) * 
> +			                 tb->tb_sb->s_blocksize;

Shift by i_blkbits (all over the place)


> +/* this function from inode.c would be used here, too */
> +extern void restart_transaction(struct reiserfs_transaction_handle *th,
> +                                struct inode *inode, struct path *path);

This decl should be in a header file.

> +
> +/* I really do not want to play with memory shortage right now, so
> +   to simplify the code, we are not going to write more than this much pages at
> +   a time. This still should considerably improve performance compared to 4k
> +   at a time case. */
> +#define REISERFS_WRITE_PAGES_AT_A_TIME 32

Page sizes vary.  A better value may be

	(128 * 1024) / PAGE_CACHE_SIZE

so that consistent behaviour is seem on platforms which have page sizes larger
than 4k.

> +int reiserfs_allocate_blocks_for_region(
> ...
> +    b_blocknr_t allocated_blocks[blocks_to_allocate]; // Pointer to a place where allocated blocknumbers would be stored. Right now statically allocated, later that will change.

This is a variable-sized array (aka: alloca).  It's not a problem IMO, but
worth pointing out...

> +    reiserfs_blocknr_hint_t hint; // hint structure for block allocator.
> +    size_t res; // return value of various functions that we call.
> +    int curr_block; // current block used to keep track of unmapped blocks.
> +    int i; // loop counter
> +    int itempos; // position in item
> +    unsigned int from = (pos & (PAGE_CACHE_SIZE - 1)); // writing position in
> +						       // first page
> +    unsigned int to = ((pos + write_bytes - 1) & (PAGE_CACHE_SIZE - 1)) + 1; /* last modified byte offset in last page */
> +    __u64 hole_size ; // amount of blocks for a file hole, if it needed to be created.
> +    int modifying_this_item = 0; // Flag for items traversal code to keep track
> +				 // of the fact that we already prepared
> +				 // current block for journal

How much stack is this function using, worst-case?

> +    for ( i = 0; i < num_pages; i++) {
> +	prepared_pages[i] = grab_cache_page(mapping, index + i); // locks the page

OK.  But note that locking multiple pages here is only free from AB/BA deadlocks
because this is the only path which does it, and it is singly-threaded via i_sem.

> +	if ( from != 0 ) {/* First page needs to be partially zeroed */
> +	    char *kaddr = kmap_atomic(prepared_pages[0], KM_USER0);
> +	    memset(kaddr, 0, from);
> +	    kunmap_atomic( kaddr, KM_USER0);
> +	    SetPageUptodate(prepared_pages[0]);
> +	}
> +	if ( to != PAGE_CACHE_SIZE ) { /* Last page needs to be partially zeroed */
> +	    char *kaddr = kmap_atomic(prepared_pages[num_pages-1], KM_USER0);
> +	    memset(kaddr+to, 0, PAGE_CACHE_SIZE - to);
> +	    kunmap_atomic( kaddr, KM_USER0);
> +	    SetPageUptodate(prepared_pages[num_pages-1]);
> +	}

This seems wrong.  This could be a newly-allocated pagecache page.  It is not
yet fully uptodate.  If (say) the subsequent copy_from_user gets a fault then
it appears that this now-uptodate pagecache page will leak uninitialised stuff?

> +			set_bit(BH_Uptodate, &bh->b_state);

set_buffer_uptodate(bh) is more conventional (two instances).

> +		/* We need to mark new file size in case this function will be
> +		   interrupted/aborted later on. And we may do this only for
> +		   holes. */
> +		inode->i_size += inode->i_sb->s_blocksize * blocks_needed;

64-bit multiply, should be a shift.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/