2007-08-28 19:20:26

by Christoph Lameter

[permalink] [raw]
Subject: [31/36] Large Blocksize: Core piece

Provide an alternate definition for the page_cache_xxx(mapping, ...)
functions that can determine the current page size from the mapping
and generate the appropriate shifts, sizes and mask for the page cache
operations. Change the basic functions that allocate pages for the
page cache to be able to handle higher order allocations.

Provide a new function

mapping_setup(stdruct address_space *, gfp_t mask, int order)

that allows the setup of a mapping of any compound page order.

mapping_set_gfp_mask() is still provided but it sets mappings to order 0.
Calls to mapping_set_gfp_mask() must be converted to mapping_setup() in
order for the filesystem to be able to use larger pages. For some key block
devices and filesystems the conversion is done here.

mapping_setup() for higher order is only allowed if the mapping does not
use DMA mappings or HIGHMEM since we do not support bouncing at the moment.
Thus BUG() on DMA mappings and clear the highmem bit of higher order mappings.

Modify the set_blocksize() function so that an arbitrary blocksize can be set.
Blocksizes up to MAX_ORDER - 1 can be set. This is typically 8MB on many
platforms (order 11). Typically file systems are not only limited by the core
VM but also by the structure of their internal data structures. The core VM
limitations fall away with this patch. The functionality provided here
can do nothing about the internal limitations of filesystems.

Known internal limitations:

Ext2 64k
XFS 64k
Reiserfs 8k
Ext3 4k (rumor has it that changing a constant can remove the limit)
Ext4 4k

Signed-off-by: Christoph Lameter <[email protected]>
---
block/Kconfig | 17 ++++++
drivers/block/rd.c | 6 ++-
fs/block_dev.c | 29 +++++++---
fs/buffer.c | 4 +-
fs/inode.c | 7 ++-
fs/xfs/linux-2.6/xfs_buf.c | 3 +-
include/linux/buffer_head.h | 12 ++++-
include/linux/fs.h | 5 ++
include/linux/pagemap.h | 121 ++++++++++++++++++++++++++++++++++++++++--
mm/filemap.c | 17 ++++--
10 files changed, 192 insertions(+), 29 deletions(-)

Index: linux-2.6/block/Kconfig
===================================================================
--- linux-2.6.orig/block/Kconfig 2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/block/Kconfig 2007-08-27 21:16:38.000000000 -0700
@@ -62,6 +62,20 @@ config BLK_DEV_BSG
protocols (e.g. Task Management Functions and SMP in Serial
Attached SCSI).

+#
+# The functions to switch on larger pages in a filesystem will return an error
+# if the gfp flags for a mapping require only DMA pages. Highmem will always
+# be switched off for higher order mappings.
+#
+config LARGE_BLOCKSIZE
+ bool "Support blocksizes larger than page size"
+ default n
+ depends on EXPERIMENTAL
+ help
+ Allows the page cache to support higher orders of pages. Higher
+ order page cache pages may be useful to increase I/O performance
+ anbd support special devices like CD or DVDs and Flash.
+
endif # BLOCK

source block/Kconfig.iosched
Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c 2007-08-27 20:59:27.000000000 -0700
+++ linux-2.6/drivers/block/rd.c 2007-08-27 21:10:38.000000000 -0700
@@ -121,7 +121,8 @@ static void make_page_uptodate(struct pa
}
} while ((bh = bh->b_this_page) != head);
} else {
- memset(page_address(page), 0, page_cache_size(page_mapping(page)));
+ memset(page_address(page), 0,
+ page_cache_size(page_mapping(page)));
}
flush_dcache_page(page);
SetPageUptodate(page);
@@ -380,7 +381,8 @@ static int rd_open(struct inode *inode,
gfp_mask = mapping_gfp_mask(mapping);
gfp_mask &= ~(__GFP_FS|__GFP_IO);
gfp_mask |= __GFP_HIGH;
- mapping_set_gfp_mask(mapping, gfp_mask);
+ mapping_setup(mapping, gfp_mask,
+ page_cache_blkbits_to_order(inode->i_blkbits));
}

return 0;
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c 2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/block_dev.c 2007-08-27 21:10:38.000000000 -0700
@@ -63,36 +63,46 @@ static void kill_bdev(struct block_devic
return;
invalidate_bh_lrus();
truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
-}
+}

int set_blocksize(struct block_device *bdev, int size)
{
- /* Size must be a power of two, and between 512 and PAGE_SIZE */
- if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
+ int order;
+
+ if (size > (PAGE_SIZE << (MAX_ORDER - 1)) ||
+ size < 512 || !is_power_of_2(size))
return -EINVAL;

/* Size cannot be smaller than the size supported by the device */
if (size < bdev_hardsect_size(bdev))
return -EINVAL;

+ order = page_cache_blocksize_to_order(size);
+
/* Don't change the size if it is same as current */
if (bdev->bd_block_size != size) {
+ int bits = blksize_bits(size);
+ struct address_space *mapping =
+ bdev->bd_inode->i_mapping;
+
sync_blockdev(bdev);
- bdev->bd_block_size = size;
- bdev->bd_inode->i_blkbits = blksize_bits(size);
kill_bdev(bdev);
+ bdev->bd_block_size = size;
+ bdev->bd_inode->i_blkbits = bits;
+ mapping_setup(mapping, GFP_NOFS, order);
}
return 0;
}
-
EXPORT_SYMBOL(set_blocksize);

int sb_set_blocksize(struct super_block *sb, int size)
{
if (set_blocksize(sb->s_bdev, size))
return 0;
- /* If we get here, we know size is power of two
- * and it's value is between 512 and PAGE_SIZE */
+ /*
+ * If we get here, we know size is power of two
+ * and it's value is valid for the page cache
+ */
sb->s_blocksize = size;
sb->s_blocksize_bits = blksize_bits(size);
return sb->s_blocksize;
@@ -574,7 +584,8 @@ struct block_device *bdget(dev_t dev)
inode->i_rdev = dev;
inode->i_bdev = bdev;
inode->i_data.a_ops = &def_blk_aops;
- mapping_set_gfp_mask(&inode->i_data, GFP_USER);
+ mapping_setup(&inode->i_data, GFP_USER,
+ page_cache_blkbits_to_order(inode->i_blkbits));
inode->i_data.backing_dev_info = &default_backing_dev_info;
spin_lock(&bdev_lock);
list_add(&bdev->bd_list, &all_bdevs);
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c 2007-08-27 21:09:19.000000000 -0700
+++ linux-2.6/fs/buffer.c 2007-08-27 21:10:38.000000000 -0700
@@ -1090,7 +1090,7 @@ __getblk_slow(struct block_device *bdev,
{
/* Size must be multiple of hard sectorsize */
if (unlikely(size & (bdev_hardsect_size(bdev)-1) ||
- (size < 512 || size > PAGE_SIZE))) {
+ size < 512 || size > (PAGE_SIZE << (MAX_ORDER - 1)))) {
printk(KERN_ERR "getblk(): invalid block size %d requested\n",
size);
printk(KERN_ERR "hardsect size: %d\n",
@@ -1811,7 +1811,7 @@ static int __block_prepare_write(struct
if (block_end > to || block_start < from)
zero_user_segments(page,
to, block_end,
- block_start, from)
+ block_start, from);
continue;
}
}
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c 2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/inode.c 2007-08-27 21:10:38.000000000 -0700
@@ -145,7 +145,8 @@ static struct inode *alloc_inode(struct
mapping->a_ops = &empty_aops;
mapping->host = inode;
mapping->flags = 0;
- mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE);
+ mapping_setup(mapping, GFP_HIGHUSER_PAGECACHE,
+ page_cache_blkbits_to_order(inode->i_blkbits));
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;

@@ -243,7 +244,7 @@ void clear_inode(struct inode *inode)
{
might_sleep();
invalidate_inode_buffers(inode);
-
+
BUG_ON(inode->i_data.nrpages);
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(inode->i_state & I_CLEAR);
@@ -528,7 +529,7 @@ repeat:
* for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
* If HIGHMEM pages are unsuitable or it is known that pages allocated
* for the page cache are not reclaimable or migratable,
- * mapping_set_gfp_mask() must be called with suitable flags on the
+ * mapping_setup() must be called with suitable flags and bits on the
* newly created inode's mapping
*
*/
Index: linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/xfs/linux-2.6/xfs_buf.c 2007-08-27 21:10:38.000000000 -0700
@@ -1547,7 +1547,8 @@ xfs_mapping_buftarg(
mapping = &inode->i_data;
mapping->a_ops = &mapping_aops;
mapping->backing_dev_info = bdi;
- mapping_set_gfp_mask(mapping, GFP_NOFS);
+ mapping_setup(mapping, GFP_NOFS,
+ page_cache_blkbits_to_order(inode->i_blkbits));
btp->bt_mapping = mapping;
return 0;
}
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h 2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/include/linux/buffer_head.h 2007-08-27 21:10:38.000000000 -0700
@@ -129,7 +129,17 @@ BUFFER_FNS(Ordered, ordered)
BUFFER_FNS(Eopnotsupp, eopnotsupp)
BUFFER_FNS(Unwritten, unwritten)

-#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
+static inline unsigned long bh_offset(struct buffer_head *bh)
+{
+ /*
+ * No mapping available. Use page struct to obtain
+ * order.
+ */
+ unsigned long mask = compound_size(bh->b_page) - 1;
+
+ return (unsigned long)bh->b_data & mask;
+}
+
#define touch_buffer(bh) mark_page_accessed(bh->b_page)

/* If we *know* page->private refers to buffer_heads */
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h 2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/include/linux/fs.h 2007-08-27 21:10:38.000000000 -0700
@@ -446,6 +446,11 @@ struct address_space {
spinlock_t i_mmap_lock; /* protect tree, count, list */
unsigned int truncate_count; /* Cover race condition with truncate */
unsigned long nrpages; /* number of total pages */
+#ifdef CONFIG_LARGE_BLOCKSIZE
+ loff_t offset_mask; /* Mask to get to offset bits */
+ unsigned int order; /* Page order of the pages in here */
+ unsigned int shift; /* Shift of index */
+#endif
pgoff_t writeback_index;/* writeback starts here */
const struct address_space_operations *a_ops; /* methods */
unsigned long flags; /* error bits/gfp mask */
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h 2007-08-27 19:29:55.000000000 -0700
+++ linux-2.6/include/linux/pagemap.h 2007-08-27 21:15:58.000000000 -0700
@@ -39,10 +39,35 @@ static inline gfp_t mapping_gfp_mask(str
* This is non-atomic. Only to be used before the mapping is activated.
* Probably needs a barrier...
*/
-static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
+static inline void mapping_setup(struct address_space *m,
+ gfp_t mask, int order)
{
m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) |
(__force unsigned long)mask;
+
+#ifdef CONFIG_LARGE_BLOCKSIZE
+ m->order = order;
+ m->shift = order + PAGE_SHIFT;
+ m->offset_mask = (PAGE_SIZE << order) - 1;
+ if (order) {
+ /*
+ * Bouncing is not supported. Requests for DMA
+ * memory will not work
+ */
+ BUG_ON(m->flags & (__GFP_DMA|__GFP_DMA32));
+ /*
+ * Bouncing not supported. We cannot use HIGHMEM
+ */
+ m->flags &= ~__GFP_HIGHMEM;
+ m->flags |= __GFP_COMP;
+ /*
+ * If we could raise the kswapd order then it should be
+ * done here.
+ *
+ * raise_kswapd_order(order);
+ */
+ }
+#endif
}

/*
@@ -62,6 +87,78 @@ static inline void mapping_set_gfp_mask(
#define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)

/*
+ * The next set of functions allow to write code that is capable of dealing
+ * with multiple page sizes.
+ */
+#ifdef CONFIG_LARGE_BLOCKSIZE
+/*
+ * Determine page order from the blkbits in the inode structure
+ */
+static inline int page_cache_blkbits_to_order(int shift)
+{
+ BUG_ON(shift < 9);
+
+ if (shift < PAGE_SHIFT)
+ return 0;
+
+ return shift - PAGE_SHIFT;
+}
+
+/*
+ * Determine page order from a given blocksize
+ */
+static inline int page_cache_blocksize_to_order(unsigned long size)
+{
+ return page_cache_blkbits_to_order(ilog2(size));
+}
+
+static inline int mapping_order(struct address_space *a)
+{
+ return a->order;
+}
+
+static inline int page_cache_shift(struct address_space *a)
+{
+ return a->shift;
+}
+
+static inline unsigned int page_cache_size(struct address_space *a)
+{
+ return a->offset_mask + 1;
+}
+
+static inline loff_t page_cache_mask(struct address_space *a)
+{
+ return ~a->offset_mask;
+}
+
+static inline unsigned int page_cache_offset(struct address_space *a,
+ loff_t pos)
+{
+ return pos & a->offset_mask;
+}
+#else
+/*
+ * Kernel configured for a fixed PAGE_SIZEd page cache
+ */
+static inline int page_cache_blkbits_to_order(int shift)
+{
+ if (shift < 9)
+ return -EINVAL;
+ if (shift > PAGE_SHIFT)
+ return -EINVAL;
+ return 0;
+}
+
+static inline int page_cache_blocksize_to_order(unsigned long size)
+{
+ if (size >= 512 && size <= PAGE_SIZE)
+ return 0;
+
+ return -EINVAL;
+}
+
+/*
* Functions that are currently setup for a fixed PAGE_SIZEd. The use of
* these will allow a variable page size pagecache in the future.
*/
@@ -90,6 +187,7 @@ static inline unsigned int page_cache_of
{
return pos & ~PAGE_MASK;
}
+#endif

static inline pgoff_t page_cache_index(struct address_space *a,
loff_t pos)
@@ -112,27 +210,37 @@ static inline loff_t page_cache_pos(stru
return ((loff_t)index << page_cache_shift(a)) + offset;
}

+/*
+ * Legacy function. Only supports order 0 pages.
+ */
+static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
+{
+ BUG_ON(mapping_order(m));
+ mapping_setup(m, mask, 0);
+}
+
#define page_cache_get(page) get_page(page)
#define page_cache_release(page) put_page(page)
void release_pages(struct page **pages, int nr, int cold);

#ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp, int);
#else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp, int order)
{
- return alloc_pages(gfp, 0);
+ return alloc_pages(gfp, order);
}
#endif

static inline struct page *page_cache_alloc(struct address_space *x)
{
- return __page_cache_alloc(mapping_gfp_mask(x));
+ return __page_cache_alloc(mapping_gfp_mask(x), mapping_order(x));
}

static inline struct page *page_cache_alloc_cold(struct address_space *x)
{
- return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+ return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD,
+ mapping_order(x));
}

typedef int filler_t(void *, struct page *);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c 2007-08-27 21:09:19.000000000 -0700
+++ linux-2.6/mm/filemap.c 2007-08-27 21:14:55.000000000 -0700
@@ -471,13 +471,13 @@ int add_to_page_cache_lru(struct page *p
}

#ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, int order)
{
if (cpuset_do_page_mem_spread()) {
int n = cpuset_mem_spread_node();
- return alloc_pages_node(n, gfp, 0);
+ return alloc_pages_node(n, gfp, order);
}
- return alloc_pages(gfp, 0);
+ return alloc_pages(gfp, order);
}
EXPORT_SYMBOL(__page_cache_alloc);
#endif
@@ -678,7 +678,7 @@ repeat:
if (!page) {
if (!cached_page) {
cached_page =
- __page_cache_alloc(gfp_mask);
+ __page_cache_alloc(gfp_mask, mapping_order(mapping));
if (!cached_page)
return NULL;
}
@@ -818,7 +818,8 @@ grab_cache_page_nowait(struct address_sp
page_cache_release(page);
return NULL;
}
- page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+ page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
+ mapping_order(mapping));
if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
page_cache_release(page);
page = NULL;
@@ -1479,6 +1480,12 @@ int generic_file_mmap(struct file * file
{
struct address_space *mapping = file->f_mapping;

+ /*
+ * Forbid mmap access to higher order mappings.
+ */
+ if (mapping_order(mapping))
+ return -ENOSYS;
+
if (!mapping->a_ops->readpage)
return -ENOEXEC;
file_accessed(file);

--


2007-08-30 00:11:20

by Mingming Cao

[permalink] [raw]
Subject: Re: [31/36] Large Blocksize: Core piece

On Tue, 2007-08-28 at 12:06 -0700, [email protected] wrote:
> plain text document attachment (0031-Large-Blocksize-Core-piece.patch)
> Provide an alternate definition for the page_cache_xxx(mapping, ...)
> functions that can determine the current page size from the mapping
> and generate the appropriate shifts, sizes and mask for the page cache
> operations. Change the basic functions that allocate pages for the
> page cache to be able to handle higher order allocations.
>
> Provide a new function
>
> mapping_setup(stdruct address_space *, gfp_t mask, int order)
>
> that allows the setup of a mapping of any compound page order.
>
> mapping_set_gfp_mask() is still provided but it sets mappings to order 0.
> Calls to mapping_set_gfp_mask() must be converted to mapping_setup() in
> order for the filesystem to be able to use larger pages. For some key block
> devices and filesystems the conversion is done here.
>
> mapping_setup() for higher order is only allowed if the mapping does not
> use DMA mappings or HIGHMEM since we do not support bouncing at the moment.
> Thus BUG() on DMA mappings and clear the highmem bit of higher order mappings.
>
> Modify the set_blocksize() function so that an arbitrary blocksize can be set.
> Blocksizes up to MAX_ORDER - 1 can be set. This is typically 8MB on many
> platforms (order 11). Typically file systems are not only limited by the core
> VM but also by the structure of their internal data structures. The core VM
> limitations fall away with this patch. The functionality provided here
> can do nothing about the internal limitations of filesystems.
>
> Known internal limitations:
>
> Ext2 64k
> XFS 64k
> Reiserfs 8k
> Ext3 4k (rumor has it that changing a constant can remove the limit)
> Ext4 4k
>

There are patches original worked by Takashi Sato to support large block
size (up to 64k) in ext2/3/4, which addressed the directory issue as
well. I just forward ported. Will posted it in a separate thread.
Haven't get a chance to integrated with your patch yet (next step).

thanks,
Mingming
> Signed-off-by: Christoph Lameter <[email protected]>
> ---
> block/Kconfig | 17 ++++++
> drivers/block/rd.c | 6 ++-
> fs/block_dev.c | 29 +++++++---
> fs/buffer.c | 4 +-
> fs/inode.c | 7 ++-
> fs/xfs/linux-2.6/xfs_buf.c | 3 +-
> include/linux/buffer_head.h | 12 ++++-
> include/linux/fs.h | 5 ++
> include/linux/pagemap.h | 121 ++++++++++++++++++++++++++++++++++++++++--
> mm/filemap.c | 17 ++++--
> 10 files changed, 192 insertions(+), 29 deletions(-)
>
> Index: linux-2.6/block/Kconfig
> ===================================================================
> --- linux-2.6.orig/block/Kconfig 2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/block/Kconfig 2007-08-27 21:16:38.000000000 -0700
> @@ -62,6 +62,20 @@ config BLK_DEV_BSG
> protocols (e.g. Task Management Functions and SMP in Serial
> Attached SCSI).
>
> +#
> +# The functions to switch on larger pages in a filesystem will return an error
> +# if the gfp flags for a mapping require only DMA pages. Highmem will always
> +# be switched off for higher order mappings.
> +#
> +config LARGE_BLOCKSIZE
> + bool "Support blocksizes larger than page size"
> + default n
> + depends on EXPERIMENTAL
> + help
> + Allows the page cache to support higher orders of pages. Higher
> + order page cache pages may be useful to increase I/O performance
> + anbd support special devices like CD or DVDs and Flash.
> +
> endif # BLOCK
>
> source block/Kconfig.iosched
> Index: linux-2.6/drivers/block/rd.c
> ===================================================================
> --- linux-2.6.orig/drivers/block/rd.c 2007-08-27 20:59:27.000000000 -0700
> +++ linux-2.6/drivers/block/rd.c 2007-08-27 21:10:38.000000000 -0700
> @@ -121,7 +121,8 @@ static void make_page_uptodate(struct pa
> }
> } while ((bh = bh->b_this_page) != head);
> } else {
> - memset(page_address(page), 0, page_cache_size(page_mapping(page)));
> + memset(page_address(page), 0,
> + page_cache_size(page_mapping(page)));
> }
> flush_dcache_page(page);
> SetPageUptodate(page);
> @@ -380,7 +381,8 @@ static int rd_open(struct inode *inode,
> gfp_mask = mapping_gfp_mask(mapping);
> gfp_mask &= ~(__GFP_FS|__GFP_IO);
> gfp_mask |= __GFP_HIGH;
> - mapping_set_gfp_mask(mapping, gfp_mask);
> + mapping_setup(mapping, gfp_mask,
> + page_cache_blkbits_to_order(inode->i_blkbits));
> }
>
> return 0;
> Index: linux-2.6/fs/block_dev.c
> ===================================================================
> --- linux-2.6.orig/fs/block_dev.c 2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/fs/block_dev.c 2007-08-27 21:10:38.000000000 -0700
> @@ -63,36 +63,46 @@ static void kill_bdev(struct block_devic
> return;
> invalidate_bh_lrus();
> truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
> -}
> +}
>
> int set_blocksize(struct block_device *bdev, int size)
> {
> - /* Size must be a power of two, and between 512 and PAGE_SIZE */
> - if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
> + int order;
> +
> + if (size > (PAGE_SIZE << (MAX_ORDER - 1)) ||
> + size < 512 || !is_power_of_2(size))
> return -EINVAL;
>
> /* Size cannot be smaller than the size supported by the device */
> if (size < bdev_hardsect_size(bdev))
> return -EINVAL;
>
> + order = page_cache_blocksize_to_order(size);
> +
> /* Don't change the size if it is same as current */
> if (bdev->bd_block_size != size) {
> + int bits = blksize_bits(size);
> + struct address_space *mapping =
> + bdev->bd_inode->i_mapping;
> +
> sync_blockdev(bdev);
> - bdev->bd_block_size = size;
> - bdev->bd_inode->i_blkbits = blksize_bits(size);
> kill_bdev(bdev);
> + bdev->bd_block_size = size;
> + bdev->bd_inode->i_blkbits = bits;
> + mapping_setup(mapping, GFP_NOFS, order);
> }
> return 0;
> }
> -
> EXPORT_SYMBOL(set_blocksize);
>
> int sb_set_blocksize(struct super_block *sb, int size)
> {
> if (set_blocksize(sb->s_bdev, size))
> return 0;
> - /* If we get here, we know size is power of two
> - * and it's value is between 512 and PAGE_SIZE */
> + /*
> + * If we get here, we know size is power of two
> + * and it's value is valid for the page cache
> + */
> sb->s_blocksize = size;
> sb->s_blocksize_bits = blksize_bits(size);
> return sb->s_blocksize;
> @@ -574,7 +584,8 @@ struct block_device *bdget(dev_t dev)
> inode->i_rdev = dev;
> inode->i_bdev = bdev;
> inode->i_data.a_ops = &def_blk_aops;
> - mapping_set_gfp_mask(&inode->i_data, GFP_USER);
> + mapping_setup(&inode->i_data, GFP_USER,
> + page_cache_blkbits_to_order(inode->i_blkbits));
> inode->i_data.backing_dev_info = &default_backing_dev_info;
> spin_lock(&bdev_lock);
> list_add(&bdev->bd_list, &all_bdevs);
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c 2007-08-27 21:09:19.000000000 -0700
> +++ linux-2.6/fs/buffer.c 2007-08-27 21:10:38.000000000 -0700
> @@ -1090,7 +1090,7 @@ __getblk_slow(struct block_device *bdev,
> {
> /* Size must be multiple of hard sectorsize */
> if (unlikely(size & (bdev_hardsect_size(bdev)-1) ||
> - (size < 512 || size > PAGE_SIZE))) {
> + size < 512 || size > (PAGE_SIZE << (MAX_ORDER - 1)))) {
> printk(KERN_ERR "getblk(): invalid block size %d requested\n",
> size);
> printk(KERN_ERR "hardsect size: %d\n",
> @@ -1811,7 +1811,7 @@ static int __block_prepare_write(struct
> if (block_end > to || block_start < from)
> zero_user_segments(page,
> to, block_end,
> - block_start, from)
> + block_start, from);
> continue;
> }
> }
> Index: linux-2.6/fs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/inode.c 2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/fs/inode.c 2007-08-27 21:10:38.000000000 -0700
> @@ -145,7 +145,8 @@ static struct inode *alloc_inode(struct
> mapping->a_ops = &empty_aops;
> mapping->host = inode;
> mapping->flags = 0;
> - mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE);
> + mapping_setup(mapping, GFP_HIGHUSER_PAGECACHE,
> + page_cache_blkbits_to_order(inode->i_blkbits));
> mapping->assoc_mapping = NULL;
> mapping->backing_dev_info = &default_backing_dev_info;
>
> @@ -243,7 +244,7 @@ void clear_inode(struct inode *inode)
> {
> might_sleep();
> invalidate_inode_buffers(inode);
> -
> +
> BUG_ON(inode->i_data.nrpages);
> BUG_ON(!(inode->i_state & I_FREEING));
> BUG_ON(inode->i_state & I_CLEAR);
> @@ -528,7 +529,7 @@ repeat:
> * for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
> * If HIGHMEM pages are unsuitable or it is known that pages allocated
> * for the page cache are not reclaimable or migratable,
> - * mapping_set_gfp_mask() must be called with suitable flags on the
> + * mapping_setup() must be called with suitable flags and bits on the
> * newly created inode's mapping
> *
> */
> Index: linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_buf.c 2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/fs/xfs/linux-2.6/xfs_buf.c 2007-08-27 21:10:38.000000000 -0700
> @@ -1547,7 +1547,8 @@ xfs_mapping_buftarg(
> mapping = &inode->i_data;
> mapping->a_ops = &mapping_aops;
> mapping->backing_dev_info = bdi;
> - mapping_set_gfp_mask(mapping, GFP_NOFS);
> + mapping_setup(mapping, GFP_NOFS,
> + page_cache_blkbits_to_order(inode->i_blkbits));
> btp->bt_mapping = mapping;
> return 0;
> }
> Index: linux-2.6/include/linux/buffer_head.h
> ===================================================================
> --- linux-2.6.orig/include/linux/buffer_head.h 2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/include/linux/buffer_head.h 2007-08-27 21:10:38.000000000 -0700
> @@ -129,7 +129,17 @@ BUFFER_FNS(Ordered, ordered)
> BUFFER_FNS(Eopnotsupp, eopnotsupp)
> BUFFER_FNS(Unwritten, unwritten)
>
> -#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
> +static inline unsigned long bh_offset(struct buffer_head *bh)
> +{
> + /*
> + * No mapping available. Use page struct to obtain
> + * order.
> + */
> + unsigned long mask = compound_size(bh->b_page) - 1;
> +
> + return (unsigned long)bh->b_data & mask;
> +}
> +
> #define touch_buffer(bh) mark_page_accessed(bh->b_page)
>
> /* If we *know* page->private refers to buffer_heads */
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h 2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/include/linux/fs.h 2007-08-27 21:10:38.000000000 -0700
> @@ -446,6 +446,11 @@ struct address_space {
> spinlock_t i_mmap_lock; /* protect tree, count, list */
> unsigned int truncate_count; /* Cover race condition with truncate */
> unsigned long nrpages; /* number of total pages */
> +#ifdef CONFIG_LARGE_BLOCKSIZE
> + loff_t offset_mask; /* Mask to get to offset bits */
> + unsigned int order; /* Page order of the pages in here */
> + unsigned int shift; /* Shift of index */
> +#endif
> pgoff_t writeback_index;/* writeback starts here */
> const struct address_space_operations *a_ops; /* methods */
> unsigned long flags; /* error bits/gfp mask */
> Index: linux-2.6/include/linux/pagemap.h
> ===================================================================
> --- linux-2.6.orig/include/linux/pagemap.h 2007-08-27 19:29:55.000000000 -0700
> +++ linux-2.6/include/linux/pagemap.h 2007-08-27 21:15:58.000000000 -0700
> @@ -39,10 +39,35 @@ static inline gfp_t mapping_gfp_mask(str
> * This is non-atomic. Only to be used before the mapping is activated.
> * Probably needs a barrier...
> */
> -static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> +static inline void mapping_setup(struct address_space *m,
> + gfp_t mask, int order)
> {
> m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) |
> (__force unsigned long)mask;
> +
> +#ifdef CONFIG_LARGE_BLOCKSIZE
> + m->order = order;
> + m->shift = order + PAGE_SHIFT;
> + m->offset_mask = (PAGE_SIZE << order) - 1;
> + if (order) {
> + /*
> + * Bouncing is not supported. Requests for DMA
> + * memory will not work
> + */
> + BUG_ON(m->flags & (__GFP_DMA|__GFP_DMA32));
> + /*
> + * Bouncing not supported. We cannot use HIGHMEM
> + */
> + m->flags &= ~__GFP_HIGHMEM;
> + m->flags |= __GFP_COMP;
> + /*
> + * If we could raise the kswapd order then it should be
> + * done here.
> + *
> + * raise_kswapd_order(order);
> + */
> + }
> +#endif
> }
>
> /*
> @@ -62,6 +87,78 @@ static inline void mapping_set_gfp_mask(
> #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
>
> /*
> + * The next set of functions allow to write code that is capable of dealing
> + * with multiple page sizes.
> + */
> +#ifdef CONFIG_LARGE_BLOCKSIZE
> +/*
> + * Determine page order from the blkbits in the inode structure
> + */
> +static inline int page_cache_blkbits_to_order(int shift)
> +{
> + BUG_ON(shift < 9);
> +
> + if (shift < PAGE_SHIFT)
> + return 0;
> +
> + return shift - PAGE_SHIFT;
> +}
> +
> +/*
> + * Determine page order from a given blocksize
> + */
> +static inline int page_cache_blocksize_to_order(unsigned long size)
> +{
> + return page_cache_blkbits_to_order(ilog2(size));
> +}
> +
> +static inline int mapping_order(struct address_space *a)
> +{
> + return a->order;
> +}
> +
> +static inline int page_cache_shift(struct address_space *a)
> +{
> + return a->shift;
> +}
> +
> +static inline unsigned int page_cache_size(struct address_space *a)
> +{
> + return a->offset_mask + 1;
> +}
> +
> +static inline loff_t page_cache_mask(struct address_space *a)
> +{
> + return ~a->offset_mask;
> +}
> +
> +static inline unsigned int page_cache_offset(struct address_space *a,
> + loff_t pos)
> +{
> + return pos & a->offset_mask;
> +}
> +#else
> +/*
> + * Kernel configured for a fixed PAGE_SIZEd page cache
> + */
> +static inline int page_cache_blkbits_to_order(int shift)
> +{
> + if (shift < 9)
> + return -EINVAL;
> + if (shift > PAGE_SHIFT)
> + return -EINVAL;
> + return 0;
> +}
> +
> +static inline int page_cache_blocksize_to_order(unsigned long size)
> +{
> + if (size >= 512 && size <= PAGE_SIZE)
> + return 0;
> +
> + return -EINVAL;
> +}
> +
> +/*
> * Functions that are currently setup for a fixed PAGE_SIZEd. The use of
> * these will allow a variable page size pagecache in the future.
> */
> @@ -90,6 +187,7 @@ static inline unsigned int page_cache_of
> {
> return pos & ~PAGE_MASK;
> }
> +#endif
>
> static inline pgoff_t page_cache_index(struct address_space *a,
> loff_t pos)
> @@ -112,27 +210,37 @@ static inline loff_t page_cache_pos(stru
> return ((loff_t)index << page_cache_shift(a)) + offset;
> }
>
> +/*
> + * Legacy function. Only supports order 0 pages.
> + */
> +static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> +{
> + BUG_ON(mapping_order(m));
> + mapping_setup(m, mask, 0);
> +}
> +
> #define page_cache_get(page) get_page(page)
> #define page_cache_release(page) put_page(page)
> void release_pages(struct page **pages, int nr, int cold);
>
> #ifdef CONFIG_NUMA
> -extern struct page *__page_cache_alloc(gfp_t gfp);
> +extern struct page *__page_cache_alloc(gfp_t gfp, int);
> #else
> -static inline struct page *__page_cache_alloc(gfp_t gfp)
> +static inline struct page *__page_cache_alloc(gfp_t gfp, int order)
> {
> - return alloc_pages(gfp, 0);
> + return alloc_pages(gfp, order);
> }
> #endif
>
> static inline struct page *page_cache_alloc(struct address_space *x)
> {
> - return __page_cache_alloc(mapping_gfp_mask(x));
> + return __page_cache_alloc(mapping_gfp_mask(x), mapping_order(x));
> }
>
> static inline struct page *page_cache_alloc_cold(struct address_space *x)
> {
> - return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
> + return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD,
> + mapping_order(x));
> }
>
> typedef int filler_t(void *, struct page *);
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c 2007-08-27 21:09:19.000000000 -0700
> +++ linux-2.6/mm/filemap.c 2007-08-27 21:14:55.000000000 -0700
> @@ -471,13 +471,13 @@ int add_to_page_cache_lru(struct page *p
> }
>
> #ifdef CONFIG_NUMA
> -struct page *__page_cache_alloc(gfp_t gfp)
> +struct page *__page_cache_alloc(gfp_t gfp, int order)
> {
> if (cpuset_do_page_mem_spread()) {
> int n = cpuset_mem_spread_node();
> - return alloc_pages_node(n, gfp, 0);
> + return alloc_pages_node(n, gfp, order);
> }
> - return alloc_pages(gfp, 0);
> + return alloc_pages(gfp, order);
> }
> EXPORT_SYMBOL(__page_cache_alloc);
> #endif
> @@ -678,7 +678,7 @@ repeat:
> if (!page) {
> if (!cached_page) {
> cached_page =
> - __page_cache_alloc(gfp_mask);
> + __page_cache_alloc(gfp_mask, mapping_order(mapping));
> if (!cached_page)
> return NULL;
> }
> @@ -818,7 +818,8 @@ grab_cache_page_nowait(struct address_sp
> page_cache_release(page);
> return NULL;
> }
> - page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
> + page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
> + mapping_order(mapping));
> if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
> page_cache_release(page);
> page = NULL;
> @@ -1479,6 +1480,12 @@ int generic_file_mmap(struct file * file
> {
> struct address_space *mapping = file->f_mapping;
>
> + /*
> + * Forbid mmap access to higher order mappings.
> + */
> + if (mapping_order(mapping))
> + return -ENOSYS;
> +
> if (!mapping->a_ops->readpage)
> return -ENOEXEC;
> file_accessed(file);
>

2007-08-30 00:12:59

by Christoph Lameter

[permalink] [raw]
Subject: Re: [31/36] Large Blocksize: Core piece

On Wed, 29 Aug 2007, Mingming Cao wrote:

> > Known internal limitations:
> >
> > Ext2 64k
> > XFS 64k
> > Reiserfs 8k
> > Ext3 4k (rumor has it that changing a constant can remove the limit)
> > Ext4 4k
> >
>
> There are patches original worked by Takashi Sato to support large block
> size (up to 64k) in ext2/3/4, which addressed the directory issue as
> well. I just forward ported. Will posted it in a separate thread.
> Haven't get a chance to integrated with your patch yet (next step).

Ahh. Great. Keep me posted.

2007-08-30 00:47:38

by Mingming Cao

[permalink] [raw]
Subject: [RFC 1/4] Large Blocksize support for Ext2/3/4

The next 4 patches support large block size (up to PAGESIZE, max 64KB)
for ext2/3/4, originally from Takashi Sato.
http://marc.info/?l=linux-ext4&m=115768873518400&w=2


It's quite simple to support large block size in ext2/3/4, mostly just
enlarge the block size limit. But it is NOT possible to have 64kB
blocksize on ext2/3/4 without some changes to the directory handling
code. The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem. The proposed solution is to put 2 empty records in such
a directory, or to special-case an impossible value like rec_len =
0xffff to handle this.


The Patch-set consists of the following 4 patches.
[1/4] ext2/3/4: enlarge blocksize
- Allow blocksize up to pagesize

[2/4] ext2: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize

[3/4] ext3: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize

[4/4] ext4: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize

Just rebase to 2.6.23-rc4 and against the ext4 patch queue. Compile tested only.

Next steps:
Need a e2fsprogs changes to able test this feature. As mkfs needs to be
educated not assuming rec_len to be blocksize all the time.
Will try it with Christoph Lameter's large block patch next.


Signed-off-by: Takashi Sato <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---
---
fs/ext2/super.c | 2 +-
fs/ext3/super.c | 5 ++++-
fs/ext4/super.c | 5 +++++
include/linux/ext2_fs.h | 4 ++--
include/linux/ext3_fs.h | 4 ++--
include/linux/ext4_fs.h | 4 ++--
6 files changed, 16 insertions(+), 8 deletions(-)

Index: linux-2.6.23-rc3/fs/ext2/super.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext2/super.c 2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext2/super.c 2007-08-29 15:22:29.000000000 -0700
@@ -775,7 +775,7 @@ static int ext2_fill_super(struct super_
brelse(bh);

if (!sb_set_blocksize(sb, blocksize)) {
- printk(KERN_ERR "EXT2-fs: blocksize too small for device.\n");
+ printk(KERN_ERR "EXT2-fs: bad blocksize %d.\n", blocksize);
goto failed_sbi;
}

Index: linux-2.6.23-rc3/fs/ext3/super.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext3/super.c 2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext3/super.c 2007-08-29 15:22:29.000000000 -0700
@@ -1549,7 +1549,10 @@ static int ext3_fill_super (struct super
}

brelse (bh);
- sb_set_blocksize(sb, blocksize);
+ if (!sb_set_blocksize(sb, blocksize)) {
+ printk(KERN_ERR "EXT3-fs: bad blocksize %d.\n", blocksize);
+ goto out_fail;
+ }
logic_sb_block = (sb_block * EXT3_MIN_BLOCK_SIZE) / blocksize;
offset = (sb_block * EXT3_MIN_BLOCK_SIZE) % blocksize;
bh = sb_bread(sb, logic_sb_block);
Index: linux-2.6.23-rc3/fs/ext4/super.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext4/super.c 2007-08-28 11:09:40.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext4/super.c 2007-08-29 15:24:08.000000000 -0700
@@ -1626,6 +1626,11 @@ static int ext4_fill_super (struct super
goto out_fail;
}

+ if (!sb_set_blocksize(sb, blocksize)) {
+ printk(KERN_ERR "EXT4-fs: bad blocksize %d.\n", blocksize);
+ goto out_fail;
+ }
+
/*
* The ext4 superblock will not be buffer aligned for other than 1kB
* block sizes. We need to calculate the offset from buffer start.
Index: linux-2.6.23-rc3/include/linux/ext2_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext2_fs.h 2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext2_fs.h 2007-08-29 15:22:29.000000000 -0700
@@ -86,8 +86,8 @@ static inline struct ext2_sb_info *EXT2_
* Macro-instructions used to manage several block sizes
*/
#define EXT2_MIN_BLOCK_SIZE 1024
-#define EXT2_MAX_BLOCK_SIZE 4096
-#define EXT2_MIN_BLOCK_LOG_SIZE 10
+#define EXT2_MAX_BLOCK_SIZE 65536
+#define EXT2_MIN_BLOCK_LOG_SIZE 10
#ifdef __KERNEL__
# define EXT2_BLOCK_SIZE(s) ((s)->s_blocksize)
#else
Index: linux-2.6.23-rc3/include/linux/ext3_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext3_fs.h 2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext3_fs.h 2007-08-29 15:22:29.000000000 -0700
@@ -76,8 +76,8 @@
* Macro-instructions used to manage several block sizes
*/
#define EXT3_MIN_BLOCK_SIZE 1024
-#define EXT3_MAX_BLOCK_SIZE 4096
-#define EXT3_MIN_BLOCK_LOG_SIZE 10
+#define EXT3_MAX_BLOCK_SIZE 65536
+#define EXT3_MIN_BLOCK_LOG_SIZE 10
#ifdef __KERNEL__
# define EXT3_BLOCK_SIZE(s) ((s)->s_blocksize)
#else
Index: linux-2.6.23-rc3/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext4_fs.h 2007-08-28 11:09:40.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext4_fs.h 2007-08-29 15:22:29.000000000 -0700
@@ -104,8 +104,8 @@ struct ext4_allocation_request {
* Macro-instructions used to manage several block sizes
*/
#define EXT4_MIN_BLOCK_SIZE 1024
-#define EXT4_MAX_BLOCK_SIZE 4096
-#define EXT4_MIN_BLOCK_LOG_SIZE 10
+#define EXT4_MAX_BLOCK_SIZE 65536
+#define EXT4_MIN_BLOCK_LOG_SIZE 10
#ifdef __KERNEL__
# define EXT4_BLOCK_SIZE(s) ((s)->s_blocksize)
#else


2007-08-30 00:47:57

by Mingming Cao

[permalink] [raw]
Subject: [RFC 2/4]ext2: fix rec_len overflow with 64KB block size

[2/4] ext2: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize


Signed-off-by: Takashi Sato <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>

---
fs/ext2/dir.c | 46 ++++++++++++++++++++++++++++++++++++----------
include/linux/ext2_fs.h | 13 +++++++++++++
2 files changed, 49 insertions(+), 10 deletions(-)

Index: linux-2.6.23-rc3/fs/ext2/dir.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext2/dir.c 2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext2/dir.c 2007-08-29 15:29:51.000000000 -0700
@@ -94,9 +94,9 @@ static void ext2_check_page(struct page
goto out;
}
for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
+ offs = EXT2_DIR_ADJUST_TAIL_OFFS(offs, chunk_size);
p = (ext2_dirent *)(kaddr + offs);
rec_len = le16_to_cpu(p->rec_len);
-
if (rec_len < EXT2_DIR_REC_LEN(1))
goto Eshort;
if (rec_len & 3)
@@ -108,6 +108,7 @@ static void ext2_check_page(struct page
if (le32_to_cpu(p->inode) > max_inumber)
goto Einumber;
}
+ offs = EXT2_DIR_ADJUST_TAIL_OFFS(offs, chunk_size);
if (offs != limit)
goto Eend;
out:
@@ -283,6 +284,7 @@ ext2_readdir (struct file * filp, void *
de = (ext2_dirent *)(kaddr+offset);
limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
for ( ;(char*)de <= limit; de = ext2_next_entry(de)) {
+ de = EXT2_DIR_ADJUST_TAIL_ADDR(kaddr, de, sb->s_blocksize);
if (de->rec_len == 0) {
ext2_error(sb, __FUNCTION__,
"zero-length directory entry");
@@ -305,8 +307,10 @@ ext2_readdir (struct file * filp, void *
return 0;
}
}
+ filp->f_pos = EXT2_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
filp->f_pos += le16_to_cpu(de->rec_len);
}
+ filp->f_pos = EXT2_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
ext2_put_page(page);
}
return 0;
@@ -343,13 +347,14 @@ struct ext2_dir_entry_2 * ext2_find_entr
start = 0;
n = start;
do {
- char *kaddr;
+ char *kaddr, *page_start;
page = ext2_get_page(dir, n);
if (!IS_ERR(page)) {
- kaddr = page_address(page);
+ kaddr = page_start = page_address(page);
de = (ext2_dirent *) kaddr;
kaddr += ext2_last_byte(dir, n) - reclen;
while ((char *) de <= kaddr) {
+ de = EXT2_DIR_ADJUST_TAIL_ADDR(page_start, de, dir->i_sb->s_blocksize);
if (de->rec_len == 0) {
ext2_error(dir->i_sb, __FUNCTION__,
"zero-length directory entry");
@@ -416,6 +421,7 @@ void ext2_set_link(struct inode *dir, st
unsigned to = from + le16_to_cpu(de->rec_len);
int err;

+ to = EXT2_DIR_ADJUST_TAIL_OFFS(to, inode->i_sb->s_blocksize);
lock_page(page);
err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
BUG_ON(err);
@@ -446,6 +452,7 @@ int ext2_add_link (struct dentry *dentry
char *kaddr;
unsigned from, to;
int err;
+ char *page_start = NULL;

/*
* We take care of directory expansion in the same loop.
@@ -460,16 +467,28 @@ int ext2_add_link (struct dentry *dentry
if (IS_ERR(page))
goto out;
lock_page(page);
- kaddr = page_address(page);
+ kaddr = page_start = page_address(page);
dir_end = kaddr + ext2_last_byte(dir, n);
de = (ext2_dirent *)kaddr;
- kaddr += PAGE_CACHE_SIZE - reclen;
+ if (chunk_size < EXT2_DIR_MAX_REC_LEN) {
+ kaddr += PAGE_CACHE_SIZE - reclen;
+ } else {
+ kaddr += PAGE_CACHE_SIZE -
+ (chunk_size - EXT2_DIR_MAX_REC_LEN) - reclen;
+ }
while ((char *)de <= kaddr) {
+ de = EXT2_DIR_ADJUST_TAIL_ADDR(page_start, de, chunk_size);
if ((char *)de == dir_end) {
/* We hit i_size */
name_len = 0;
- rec_len = chunk_size;
- de->rec_len = cpu_to_le16(chunk_size);
+ if (chunk_size < EXT2_DIR_MAX_REC_LEN) {
+ rec_len = chunk_size;
+ de->rec_len = cpu_to_le16(chunk_size);
+ } else {
+ rec_len = EXT2_DIR_MAX_REC_LEN;
+ de->rec_len =
+ cpu_to_le16(EXT2_DIR_MAX_REC_LEN);
+ }
de->inode = 0;
goto got_it;
}
@@ -499,6 +518,7 @@ int ext2_add_link (struct dentry *dentry
got_it:
from = (char*)de - (char*)page_address(page);
to = from + rec_len;
+ to = EXT2_DIR_ADJUST_TAIL_OFFS(to, chunk_size);
err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
if (err)
goto out_unlock;
@@ -541,6 +561,7 @@ int ext2_delete_entry (struct ext2_dir_e
ext2_dirent * de = (ext2_dirent *) (kaddr + from);
int err;

+ to = EXT2_DIR_ADJUST_TAIL_OFFS(to, inode->i_sb->s_blocksize);
while ((char*)de < (char*)dir) {
if (de->rec_len == 0) {
ext2_error(inode->i_sb, __FUNCTION__,
@@ -598,7 +619,11 @@ int ext2_make_empty(struct inode *inode,

de = (struct ext2_dir_entry_2 *)(kaddr + EXT2_DIR_REC_LEN(1));
de->name_len = 2;
- de->rec_len = cpu_to_le16(chunk_size - EXT2_DIR_REC_LEN(1));
+ if (chunk_size < EXT2_DIR_MAX_REC_LEN) {
+ de->rec_len = cpu_to_le16(chunk_size - EXT2_DIR_REC_LEN(1));
+ } else {
+ de->rec_len = cpu_to_le16(EXT2_DIR_MAX_REC_LEN - EXT2_DIR_REC_LEN(1));
+ }
de->inode = cpu_to_le32(parent->i_ino);
memcpy (de->name, "..\0", 4);
ext2_set_de_type (de, inode);
@@ -618,18 +643,19 @@ int ext2_empty_dir (struct inode * inode
unsigned long i, npages = dir_pages(inode);

for (i = 0; i < npages; i++) {
- char *kaddr;
+ char *kaddr, *page_start;
ext2_dirent * de;
page = ext2_get_page(inode, i);

if (IS_ERR(page))
continue;

- kaddr = page_address(page);
+ kaddr = page_start = page_address(page);
de = (ext2_dirent *)kaddr;
kaddr += ext2_last_byte(inode, i) - EXT2_DIR_REC_LEN(1);

while ((char *)de <= kaddr) {
+ de = EXT2_DIR_ADJUST_TAIL_ADDR(page_start, de, inode->i_sb->s_blocksize);
if (de->rec_len == 0) {
ext2_error(inode->i_sb, __FUNCTION__,
"zero-length directory entry");
Index: linux-2.6.23-rc3/include/linux/ext2_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext2_fs.h 2007-08-29 15:22:29.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext2_fs.h 2007-08-29 15:29:51.000000000 -0700
@@ -557,5 +557,18 @@ enum {
#define EXT2_DIR_ROUND (EXT2_DIR_PAD - 1)
#define EXT2_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT2_DIR_ROUND) & \
~EXT2_DIR_ROUND)
+#define EXT2_DIR_MAX_REC_LEN 65532
+
+/*
+ * Align a tail offset(address) to the end of a directory block
+ */
+#define EXT2_DIR_ADJUST_TAIL_OFFS(offs, bsize) \
+ ((((offs) & ((bsize) -1)) == EXT2_DIR_MAX_REC_LEN) ? \
+ ((offs) + (bsize) - EXT2_DIR_MAX_REC_LEN):(offs))
+
+#define EXT2_DIR_ADJUST_TAIL_ADDR(page, de, bsize) \
+ (((((char*)(de) - (page)) & ((bsize) - 1)) == EXT2_DIR_MAX_REC_LEN) ? \
+ ((ext2_dirent*)((char*)(de) + (bsize) - EXT2_DIR_MAX_REC_LEN)):(de))

#endif /* _LINUX_EXT2_FS_H */
+


2007-08-30 00:48:26

by Mingming Cao

[permalink] [raw]
Subject: [RFC 3/4] ext3: fix rec_len overflow with 64KB block size

[3/4] ext3: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize

Signed-off-by: Takashi Sato <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>

---
fs/ext3/dir.c | 13 ++++---
fs/ext3/namei.c | 88 +++++++++++++++++++++++++++++++++++++++---------
include/linux/ext3_fs.h | 9 ++++
3 files changed, 91 insertions(+), 19 deletions(-)

Index: linux-2.6.23-rc3/fs/ext3/dir.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext3/dir.c 2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext3/dir.c 2007-08-29 15:40:06.000000000 -0700
@@ -100,10 +100,11 @@ static int ext3_readdir(struct file * fi
unsigned long offset;
int i, stored;
struct ext3_dir_entry_2 *de;
- struct super_block *sb;
int err;
struct inode *inode = filp->f_path.dentry->d_inode;
int ret = 0;
+ struct super_block *sb = inode->i_sb;
+ unsigned tail = sb->s_blocksize;

sb = inode->i_sb;

@@ -167,8 +168,11 @@ revalidate:
* readdir(2), then we might be pointing to an invalid
* dirent right now. Scan from the start of the block
* to make sure. */
- if (filp->f_version != inode->i_version) {
- for (i = 0; i < sb->s_blocksize && i < offset; ) {
+ if (tail > EXT3_DIR_MAX_REC_LEN) {
+ tail = EXT3_DIR_MAX_REC_LEN;
+ }
+ if (filp->f_version != inode->i_version) {
+ for (i = 0; i < tail && i < offset; ) {
de = (struct ext3_dir_entry_2 *)
(bh->b_data + i);
/* It's too expensive to do a full
@@ -189,7 +193,7 @@ revalidate:
}

while (!error && filp->f_pos < inode->i_size
- && offset < sb->s_blocksize) {
+ && offset < tail) {
de = (struct ext3_dir_entry_2 *) (bh->b_data + offset);
if (!ext3_check_dir_entry ("ext3_readdir", inode, de,
bh, offset)) {
@@ -225,6 +229,7 @@ revalidate:
}
filp->f_pos += le16_to_cpu(de->rec_len);
}
+ filp->f_pos = EXT3_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
offset = 0;
brelse (bh);
}
Index: linux-2.6.23-rc3/fs/ext3/namei.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext3/namei.c 2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext3/namei.c 2007-08-29 15:30:10.000000000 -0700
@@ -262,9 +262,13 @@ static struct stats dx_show_leaf(struct
unsigned names = 0, space = 0;
char *base = (char *) de;
struct dx_hash_info h = *hinfo;
+ unsigned tail = size;

printk("names: ");
- while ((char *) de < base + size)
+ if (tail > EXT3_DIR_MAX_REC_LEN) {
+ tail = EXT3_DIR_MAX_REC_LEN;
+ }
+ while ((char *) de < base + tail)
{
if (de->inode)
{
@@ -677,8 +681,12 @@ static int dx_make_map (struct ext3_dir_
int count = 0;
char *base = (char *) de;
struct dx_hash_info h = *hinfo;
+ unsigned tail = size;

- while ((char *) de < base + size)
+ if (tail > EXT3_DIR_MAX_REC_LEN) {
+ tail = EXT3_DIR_MAX_REC_LEN;
+ }
+ while ((char *) de < base + tail)
{
if (de->name_len && de->inode) {
ext3fs_dirhash(de->name, de->name_len, &h);
@@ -775,9 +783,13 @@ static inline int search_dirblock(struct
int de_len;
const char *name = dentry->d_name.name;
int namelen = dentry->d_name.len;
+ unsigned tail = dir->i_sb->s_blocksize;

de = (struct ext3_dir_entry_2 *) bh->b_data;
- dlimit = bh->b_data + dir->i_sb->s_blocksize;
+ if (tail > EXT3_DIR_MAX_REC_LEN) {
+ tail = EXT3_DIR_MAX_REC_LEN;
+ }
+ dlimit = bh->b_data + tail;
while ((char *) de < dlimit) {
/* this code is executed quadratically often */
/* do minimal checking `by hand' */
@@ -1115,6 +1127,9 @@ static struct ext3_dir_entry_2* dx_pack_
unsigned rec_len = 0;

prev = to = de;
+ if (size > EXT3_DIR_MAX_REC_LEN) {
+ size = EXT3_DIR_MAX_REC_LEN;
+ }
while ((char*)de < base + size) {
next = (struct ext3_dir_entry_2 *) ((char *) de +
le16_to_cpu(de->rec_len));
@@ -1180,8 +1195,15 @@ static struct ext3_dir_entry_2 *do_split
/* Fancy dance to stay within two buffers */
de2 = dx_move_dirents(data1, data2, map + split, count - split);
de = dx_pack_dirents(data1,blocksize);
- de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
- de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+ if (blocksize < EXT3_DIR_MAX_REC_LEN) {
+ de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+ de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+ } else {
+ de->rec_len = cpu_to_le16(data1 + EXT3_DIR_MAX_REC_LEN -
+ (char *) de);
+ de2->rec_len = cpu_to_le16(data2 + EXT3_DIR_MAX_REC_LEN -
+ (char *) de2);
+ }
dxtrace(dx_show_leaf (hinfo, (struct ext3_dir_entry_2 *) data1, blocksize, 1));
dxtrace(dx_show_leaf (hinfo, (struct ext3_dir_entry_2 *) data2, blocksize, 1));

@@ -1236,11 +1258,15 @@ static int add_dirent_to_buf(handle_t *h
unsigned short reclen;
int nlen, rlen, err;
char *top;
+ unsigned tail = dir->i_sb->s_blocksize;

+ if (tail > EXT3_DIR_MAX_REC_LEN) {
+ tail = EXT3_DIR_MAX_REC_LEN;
+ }
reclen = EXT3_DIR_REC_LEN(namelen);
if (!de) {
de = (struct ext3_dir_entry_2 *)bh->b_data;
- top = bh->b_data + dir->i_sb->s_blocksize - reclen;
+ top = bh->b_data + tail - reclen;
while ((char *) de <= top) {
if (!ext3_check_dir_entry("ext3_add_entry", dir, de,
bh, offset)) {
@@ -1354,13 +1380,21 @@ static int make_indexed_dir(handle_t *ha
/* The 0th block becomes the root, move the dirents out */
fde = &root->dotdot;
de = (struct ext3_dir_entry_2 *)((char *)fde + le16_to_cpu(fde->rec_len));
- len = ((char *) root) + blocksize - (char *) de;
+ if (blocksize < EXT3_DIR_MAX_REC_LEN) {
+ len = ((char *) root) + blocksize - (char *) de;
+ } else {
+ len = ((char *) root) + EXT3_DIR_MAX_REC_LEN - (char *) de;
+ }
memcpy (data1, de, len);
de = (struct ext3_dir_entry_2 *) data1;
top = data1 + len;
while ((char *)(de2=(void*)de+le16_to_cpu(de->rec_len)) < top)
de = de2;
- de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+ if (blocksize < EXT3_DIR_MAX_REC_LEN) {
+ de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+ } else {
+ de->rec_len = cpu_to_le16(data1 + EXT3_DIR_MAX_REC_LEN - (char *) de);
+ }
/* Initialize the root; the dot dirents already exist */
de = (struct ext3_dir_entry_2 *) (&root->dotdot);
de->rec_len = cpu_to_le16(blocksize - EXT3_DIR_REC_LEN(2));
@@ -1450,7 +1484,11 @@ static int ext3_add_entry (handle_t *han
return retval;
de = (struct ext3_dir_entry_2 *) bh->b_data;
de->inode = 0;
- de->rec_len = cpu_to_le16(blocksize);
+ if (blocksize < EXT3_DIR_MAX_REC_LEN) {
+ de->rec_len = cpu_to_le16(blocksize);
+ } else {
+ de->rec_len = cpu_to_le16(EXT3_DIR_MAX_REC_LEN);
+ }
return add_dirent_to_buf(handle, dentry, inode, de, bh);
}

@@ -1514,7 +1552,12 @@ static int ext3_dx_add_entry(handle_t *h
goto cleanup;
node2 = (struct dx_node *)(bh2->b_data);
entries2 = node2->entries;
- node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
+ if (sb->s_blocksize < EXT3_DIR_MAX_REC_LEN) {
+ node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
+ } else {
+ node2->fake.rec_len =
+ cpu_to_le16(EXT3_DIR_MAX_REC_LEN);
+ }
node2->fake.inode = 0;
BUFFER_TRACE(frame->bh, "get_write_access");
err = ext3_journal_get_write_access(handle, frame->bh);
@@ -1602,11 +1645,15 @@ static int ext3_delete_entry (handle_t *
{
struct ext3_dir_entry_2 * de, * pde;
int i;
+ unsigned tail = bh->b_size;

i = 0;
pde = NULL;
de = (struct ext3_dir_entry_2 *) bh->b_data;
- while (i < bh->b_size) {
+ if (tail > EXT3_DIR_MAX_REC_LEN) {
+ tail = EXT3_DIR_MAX_REC_LEN;
+ }
+ while (i < tail) {
if (!ext3_check_dir_entry("ext3_delete_entry", dir, de, bh, i))
return -EIO;
if (de == de_del) {
@@ -1766,7 +1813,11 @@ retry:
de = (struct ext3_dir_entry_2 *)
((char *) de + le16_to_cpu(de->rec_len));
de->inode = cpu_to_le32(dir->i_ino);
- de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize-EXT3_DIR_REC_LEN(1));
+ if (inode->i_sb->s_blocksize < EXT3_DIR_MAX_REC_LEN) {
+ de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize-EXT3_DIR_REC_LEN(1));
+ } else {
+ de->rec_len = cpu_to_le16(EXT3_DIR_MAX_REC_LEN-EXT3_DIR_REC_LEN(1));
+ }
de->name_len = 2;
strcpy (de->name, "..");
ext3_set_de_type(dir->i_sb, de, S_IFDIR);
@@ -1801,10 +1852,10 @@ static int empty_dir (struct inode * ino
unsigned long offset;
struct buffer_head * bh;
struct ext3_dir_entry_2 * de, * de1;
- struct super_block * sb;
+ struct super_block * sb = inode->i_sb;
int err = 0;
+ unsigned tail = sb->s_blocksize;

- sb = inode->i_sb;
if (inode->i_size < EXT3_DIR_REC_LEN(1) + EXT3_DIR_REC_LEN(2) ||
!(bh = ext3_bread (NULL, inode, 0, 0, &err))) {
if (err)
@@ -1831,11 +1882,17 @@ static int empty_dir (struct inode * ino
return 1;
}
offset = le16_to_cpu(de->rec_len) + le16_to_cpu(de1->rec_len);
+ if (offset == EXT3_DIR_MAX_REC_LEN) {
+ offset += sb->s_blocksize - EXT3_DIR_MAX_REC_LEN;
+ }
de = (struct ext3_dir_entry_2 *)
((char *) de1 + le16_to_cpu(de1->rec_len));
+ if (tail > EXT3_DIR_MAX_REC_LEN) {
+ tail = EXT3_DIR_MAX_REC_LEN;
+ }
while (offset < inode->i_size ) {
if (!bh ||
- (void *) de >= (void *) (bh->b_data+sb->s_blocksize)) {
+ (void *) de >= (void *) (bh->b_data + tail)) {
err = 0;
brelse (bh);
bh = ext3_bread (NULL, inode,
@@ -1862,6 +1919,7 @@ static int empty_dir (struct inode * ino
return 0;
}
offset += le16_to_cpu(de->rec_len);
+ offset = EXT3_DIR_ADJUST_TAIL_OFFS(offset, sb->s_blocksize);
de = (struct ext3_dir_entry_2 *)
((char *) de + le16_to_cpu(de->rec_len));
}
Index: linux-2.6.23-rc3/include/linux/ext3_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext3_fs.h 2007-08-29 15:22:29.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext3_fs.h 2007-08-29 15:30:10.000000000 -0700
@@ -660,6 +660,15 @@ struct ext3_dir_entry_2 {
#define EXT3_DIR_ROUND (EXT3_DIR_PAD - 1)
#define EXT3_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT3_DIR_ROUND) & \
~EXT3_DIR_ROUND)
+#define EXT3_DIR_MAX_REC_LEN 65532
+
+/*
+ * Align a tail offset to the end of a directory block
+ */
+#define EXT3_DIR_ADJUST_TAIL_OFFS(offs, bsize) \
+ ((((offs) & ((bsize) -1)) == EXT3_DIR_MAX_REC_LEN) ? \
+ ((offs) + (bsize) - EXT3_DIR_MAX_REC_LEN):(offs))
+
/*
* Hash Tree Directory indexing
* (c) Daniel Phillips, 2001


2007-08-30 00:48:44

by Mingming Cao

[permalink] [raw]
Subject: [RFC 4/4]ext4: fix rec_len overflow with 64KB block size

[4/4] ext4: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize


Signed-off-by: Takashi Sato <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>

---
fs/ext4/dir.c | 11 ++++--
fs/ext4/namei.c | 88 +++++++++++++++++++++++++++++++++++++++---------
include/linux/ext4_fs.h | 9 ++++
3 files changed, 90 insertions(+), 18 deletions(-)

Index: linux-2.6.23-rc3/fs/ext4/dir.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext4/dir.c 2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext4/dir.c 2007-08-29 15:33:19.000000000 -0700
@@ -100,10 +100,11 @@ static int ext4_readdir(struct file * fi
unsigned long offset;
int i, stored;
struct ext4_dir_entry_2 *de;
- struct super_block *sb;
int err;
struct inode *inode = filp->f_path.dentry->d_inode;
+ struct super_block *sb = inode->i_sb;
int ret = 0;
+ unsigned tail = sb->s_blocksize;

sb = inode->i_sb;

@@ -166,8 +167,11 @@ revalidate:
* readdir(2), then we might be pointing to an invalid
* dirent right now. Scan from the start of the block
* to make sure. */
+ if (tail > EXT4_DIR_MAX_REC_LEN) {
+ tail = EXT4_DIR_MAX_REC_LEN;
+ }
if (filp->f_version != inode->i_version) {
- for (i = 0; i < sb->s_blocksize && i < offset; ) {
+ for (i = 0; i < tail && i < offset; ) {
de = (struct ext4_dir_entry_2 *)
(bh->b_data + i);
/* It's too expensive to do a full
@@ -188,7 +192,7 @@ revalidate:
}

while (!error && filp->f_pos < inode->i_size
- && offset < sb->s_blocksize) {
+ && offset < tail) {
de = (struct ext4_dir_entry_2 *) (bh->b_data + offset);
if (!ext4_check_dir_entry ("ext4_readdir", inode, de,
bh, offset)) {
@@ -225,6 +229,7 @@ revalidate:
}
filp->f_pos += le16_to_cpu(de->rec_len);
}
+ filp->f_pos = EXT4_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
offset = 0;
brelse (bh);
}
Index: linux-2.6.23-rc3/fs/ext4/namei.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext4/namei.c 2007-08-28 11:08:48.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext4/namei.c 2007-08-29 15:30:22.000000000 -0700
@@ -262,9 +262,13 @@ static struct stats dx_show_leaf(struct
unsigned names = 0, space = 0;
char *base = (char *) de;
struct dx_hash_info h = *hinfo;
+ unsigned tail = size;

printk("names: ");
- while ((char *) de < base + size)
+ if (tail > EXT4_DIR_MAX_REC_LEN) {
+ tail = EXT4_DIR_MAX_REC_LEN;
+ }
+ while ((char *) de < base + tail)
{
if (de->inode)
{
@@ -677,8 +681,12 @@ static int dx_make_map (struct ext4_dir_
int count = 0;
char *base = (char *) de;
struct dx_hash_info h = *hinfo;
+ unsigned tail = size;

- while ((char *) de < base + size)
+ if (tail > EXT4_DIR_MAX_REC_LEN) {
+ tail = EXT4_DIR_MAX_REC_LEN;
+ }
+ while ((char *) de < base + tail)
{
if (de->name_len && de->inode) {
ext4fs_dirhash(de->name, de->name_len, &h);
@@ -773,9 +781,13 @@ static inline int search_dirblock(struct
int de_len;
const char *name = dentry->d_name.name;
int namelen = dentry->d_name.len;
+ unsigned tail = dir->i_sb->s_blocksize;

de = (struct ext4_dir_entry_2 *) bh->b_data;
- dlimit = bh->b_data + dir->i_sb->s_blocksize;
+ if (tail > EXT4_DIR_MAX_REC_LEN) {
+ tail = EXT4_DIR_MAX_REC_LEN;
+ }
+ dlimit = bh->b_data + tail;
while ((char *) de < dlimit) {
/* this code is executed quadratically often */
/* do minimal checking `by hand' */
@@ -1113,6 +1125,9 @@ static struct ext4_dir_entry_2* dx_pack_
unsigned rec_len = 0;

prev = to = de;
+ if (size > EXT4_DIR_MAX_REC_LEN) {
+ size = EXT4_DIR_MAX_REC_LEN;
+ }
while ((char*)de < base + size) {
next = (struct ext4_dir_entry_2 *) ((char *) de +
le16_to_cpu(de->rec_len));
@@ -1178,8 +1193,15 @@ static struct ext4_dir_entry_2 *do_split
/* Fancy dance to stay within two buffers */
de2 = dx_move_dirents(data1, data2, map + split, count - split);
de = dx_pack_dirents(data1,blocksize);
- de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
- de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+ if (blocksize < EXT4_DIR_MAX_REC_LEN) {
+ de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+ de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+ } else {
+ de->rec_len = cpu_to_le16(data1 + EXT4_DIR_MAX_REC_LEN -
+ (char *) de);
+ de2->rec_len = cpu_to_le16(data2 + EXT4_DIR_MAX_REC_LEN -
+ (char *) de2);
+ }
dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data1, blocksize, 1));
dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data2, blocksize, 1));

@@ -1234,11 +1256,15 @@ static int add_dirent_to_buf(handle_t *h
unsigned short reclen;
int nlen, rlen, err;
char *top;
+ unsigned tail = dir->i_sb->s_blocksize;

+ if (tail > EXT4_DIR_MAX_REC_LEN) {
+ tail = EXT4_DIR_MAX_REC_LEN;
+ }
reclen = EXT4_DIR_REC_LEN(namelen);
if (!de) {
de = (struct ext4_dir_entry_2 *)bh->b_data;
- top = bh->b_data + dir->i_sb->s_blocksize - reclen;
+ top = bh->b_data + tail - reclen;
while ((char *) de <= top) {
if (!ext4_check_dir_entry("ext4_add_entry", dir, de,
bh, offset)) {
@@ -1351,13 +1377,21 @@ static int make_indexed_dir(handle_t *ha
/* The 0th block becomes the root, move the dirents out */
fde = &root->dotdot;
de = (struct ext4_dir_entry_2 *)((char *)fde + le16_to_cpu(fde->rec_len));
- len = ((char *) root) + blocksize - (char *) de;
+ if (blocksize < EXT4_DIR_MAX_REC_LEN) {
+ len = ((char *) root) + blocksize - (char *) de;
+ } else {
+ len = ((char *) root) + EXT4_DIR_MAX_REC_LEN - (char *) de;
+ }
memcpy (data1, de, len);
de = (struct ext4_dir_entry_2 *) data1;
top = data1 + len;
while ((char *)(de2=(void*)de+le16_to_cpu(de->rec_len)) < top)
de = de2;
- de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+ if (blocksize < EXT4_DIR_MAX_REC_LEN) {
+ de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+ } else {
+ de->rec_len = cpu_to_le16(data1 + EXT4_DIR_MAX_REC_LEN - (char *) de);
+ }
/* Initialize the root; the dot dirents already exist */
de = (struct ext4_dir_entry_2 *) (&root->dotdot);
de->rec_len = cpu_to_le16(blocksize - EXT4_DIR_REC_LEN(2));
@@ -1447,7 +1481,11 @@ static int ext4_add_entry (handle_t *han
return retval;
de = (struct ext4_dir_entry_2 *) bh->b_data;
de->inode = 0;
- de->rec_len = cpu_to_le16(blocksize);
+ if (blocksize < EXT4_DIR_MAX_REC_LEN) {
+ de->rec_len = cpu_to_le16(blocksize);
+ } else {
+ de->rec_len = cpu_to_le16(EXT4_DIR_MAX_REC_LEN);
+ }
return add_dirent_to_buf(handle, dentry, inode, de, bh);
}

@@ -1511,7 +1549,12 @@ static int ext4_dx_add_entry(handle_t *h
goto cleanup;
node2 = (struct dx_node *)(bh2->b_data);
entries2 = node2->entries;
- node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
+ if (sb->s_blocksize < EXT4_DIR_MAX_REC_LEN) {
+ node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
+ } else {
+ node2->fake.rec_len =
+ cpu_to_le16(EXT4_DIR_MAX_REC_LEN);
+ }
node2->fake.inode = 0;
BUFFER_TRACE(frame->bh, "get_write_access");
err = ext4_journal_get_write_access(handle, frame->bh);
@@ -1599,11 +1642,15 @@ static int ext4_delete_entry (handle_t *
{
struct ext4_dir_entry_2 * de, * pde;
int i;
+ unsigned tail = bh->b_size;

i = 0;
pde = NULL;
de = (struct ext4_dir_entry_2 *) bh->b_data;
- while (i < bh->b_size) {
+ if (tail > EXT4_DIR_MAX_REC_LEN) {
+ tail = EXT4_DIR_MAX_REC_LEN;
+ }
+ while (i < tail) {
if (!ext4_check_dir_entry("ext4_delete_entry", dir, de, bh, i))
return -EIO;
if (de == de_del) {
@@ -1791,7 +1838,11 @@ retry:
de = (struct ext4_dir_entry_2 *)
((char *) de + le16_to_cpu(de->rec_len));
de->inode = cpu_to_le32(dir->i_ino);
- de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize-EXT4_DIR_REC_LEN(1));
+ if (inode->i_sb->s_blocksize < EXT4_DIR_MAX_REC_LEN) {
+ de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize-EXT4_DIR_REC_LEN(1));
+ } else {
+ de->rec_len = cpu_to_le16(EXT4_DIR_MAX_REC_LEN-EXT4_DIR_REC_LEN(1));
+ }
de->name_len = 2;
strcpy (de->name, "..");
ext4_set_de_type(dir->i_sb, de, S_IFDIR);
@@ -1826,10 +1877,10 @@ static int empty_dir (struct inode * ino
unsigned long offset;
struct buffer_head * bh;
struct ext4_dir_entry_2 * de, * de1;
- struct super_block * sb;
+ struct super_block * sb = inode->i_sb;
int err = 0;
+ unsigned tail = sb->s_blocksize;

- sb = inode->i_sb;
if (inode->i_size < EXT4_DIR_REC_LEN(1) + EXT4_DIR_REC_LEN(2) ||
!(bh = ext4_bread (NULL, inode, 0, 0, &err))) {
if (err)
@@ -1856,11 +1907,17 @@ static int empty_dir (struct inode * ino
return 1;
}
offset = le16_to_cpu(de->rec_len) + le16_to_cpu(de1->rec_len);
+ if (offset == EXT4_DIR_MAX_REC_LEN) {
+ offset += sb->s_blocksize - EXT4_DIR_MAX_REC_LEN;
+ }
de = (struct ext4_dir_entry_2 *)
((char *) de1 + le16_to_cpu(de1->rec_len));
+ if (tail > EXT4_DIR_MAX_REC_LEN) {
+ tail = EXT4_DIR_MAX_REC_LEN;
+ }
while (offset < inode->i_size ) {
if (!bh ||
- (void *) de >= (void *) (bh->b_data+sb->s_blocksize)) {
+ (void *) de >= (void *) (bh->b_data + tail)) {
err = 0;
brelse (bh);
bh = ext4_bread (NULL, inode,
@@ -1887,6 +1944,7 @@ static int empty_dir (struct inode * ino
return 0;
}
offset += le16_to_cpu(de->rec_len);
+ offset = EXT4_DIR_ADJUST_TAIL_OFFS(offset, sb->s_blocksize);
de = (struct ext4_dir_entry_2 *)
((char *) de + le16_to_cpu(de->rec_len));
}
Index: linux-2.6.23-rc3/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext4_fs.h 2007-08-29 15:22:29.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext4_fs.h 2007-08-29 15:30:22.000000000 -0700
@@ -834,6 +834,15 @@ struct ext4_dir_entry_2 {
#define EXT4_DIR_ROUND (EXT4_DIR_PAD - 1)
#define EXT4_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT4_DIR_ROUND) & \
~EXT4_DIR_ROUND)
+#define EXT4_DIR_MAX_REC_LEN 65532
+
+/*
+ * Align a tail offset to the end of a directory block
+ */
+#define EXT4_DIR_ADJUST_TAIL_OFFS(offs, bsize) \
+ ((((offs) & ((bsize) -1)) == EXT4_DIR_MAX_REC_LEN) ? \
+ ((offs) + (bsize) - EXT4_DIR_MAX_REC_LEN):(offs))
+
/*
* Hash Tree Directory indexing
* (c) Daniel Phillips, 2001


2007-08-30 01:00:15

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/4] Large Blocksize support for Ext2/3/4

On Wed, 29 Aug 2007, Mingming Cao wrote:

> It's quite simple to support large block size in ext2/3/4, mostly just
> enlarge the block size limit. But it is NOT possible to have 64kB
> blocksize on ext2/3/4 without some changes to the directory handling
> code. The reason is that an empty 64kB directory block would have a
> rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
> the filesystem. The proposed solution is to put 2 empty records in such
> a directory, or to special-case an impossible value like rec_len =
> 0xffff to handle this.

Ahh. Good.

I could add the path to the large blocksize patchset?

2007-09-01 00:01:27

by Mingming Cao

[permalink] [raw]
Subject: Re: [RFC 1/4] Large Blocksize support for Ext2/3/4

On Wed, 2007-08-29 at 17:47 -0700, Mingming Cao wrote:

> Just rebase to 2.6.23-rc4 and against the ext4 patch queue. Compile tested only.
>
> Next steps:
> Need a e2fsprogs changes to able test this feature. As mkfs needs to be
> educated not assuming rec_len to be blocksize all the time.
> Will try it with Christoph Lameter's large block patch next.
>

Two problems were found when testing largeblock on ext3. Patches to
follow.

Good news is, with your changes, plus all these extN changes, I am able
to run ext2/3/4 with 64k block size, tested on x86 and ppc64 with 4k
page size. fsx test runs fine for an hour on ext3 with 16k blocksize on
x86:-)

Mingming

2007-09-01 00:12:26

by Mingming Cao

[permalink] [raw]
Subject: [RFC 1/2] JBD: slab management support for large block(>8k)

>From clameter:
Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.

Signed-off-by: Mingming Cao <[email protected]>

Index: my2.6/fs/jbd/journal.c
===================================================================
--- my2.6.orig/fs/jbd/journal.c 2007-08-30 18:40:02.000000000 -0700
+++ my2.6/fs/jbd/journal.c 2007-08-31 11:01:18.000000000 -0700
@@ -1627,16 +1627,17 @@ void * __jbd_kmalloc (const char *where,
* jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
* and allocate frozen and commit buffers from these slabs.
*
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
+ * (Note: We only seem to need the definitions here for the SLAB_DEBUG
+ * case. In non debug operations SLUB will find the corresponding kmalloc
+ * cache and create an alias. --clameter)
*/
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size) (size >> 11)
+#define JBD_MAX_SLABS 7
+#define JBD_SLAB_INDEX(size) get_order((size) << (PAGE_SHIFT - 10))

static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
static const char *jbd_slab_names[JBD_MAX_SLABS] = {
- "jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
+ "jbd_1k", "jbd_2k", "jbd_4k", "jbd_8k",
+ "jbd_16k", "jbd_32k", "jbd_64k"
};

static void journal_destroy_jbd_slabs(void)
Index: my2.6/fs/jbd2/journal.c
===================================================================
--- my2.6.orig/fs/jbd2/journal.c 2007-08-30 18:40:02.000000000 -0700
+++ my2.6/fs/jbd2/journal.c 2007-08-31 11:04:37.000000000 -0700
@@ -1639,16 +1639,18 @@ void * __jbd2_kmalloc (const char *where
* jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
* and allocate frozen and commit buffers from these slabs.
*
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
+ * (Note: We only seem to need the definitions here for the SLAB_DEBUG
+ * case. In non debug operations SLUB will find the corresponding kmalloc
+ * cache and create an alias. --clameter)
*/

-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size) (size >> 11)
+#define JBD_MAX_SLABS 7
+#define JBD_SLAB_INDEX(size) get_order((size) << (PAGE_SHIFT - 10))

static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
static const char *jbd_slab_names[JBD_MAX_SLABS] = {
- "jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k"
+ "jbd2_1k", "jbd2_2k", "jbd2_4k", "jbd2_8k",
+ "jbd2_16k", "jbd2_32k", "jbd2_64k"
};

static void jbd2_journal_destroy_jbd_slabs(void)


2007-09-01 00:12:39

by Mingming Cao

[permalink] [raw]
Subject: [RFC 2/2] JBD: blocks reservation fix for large block support

The blocks per page could be less or quals to 1 with the large block support in VM.
The patch fixed the way to calculate the number of blocks to reserve in journal in the
case blocksize > pagesize.



Signed-off-by: Mingming Cao <[email protected]>

Index: my2.6/fs/jbd/journal.c
===================================================================
--- my2.6.orig/fs/jbd/journal.c 2007-08-31 13:27:16.000000000 -0700
+++ my2.6/fs/jbd/journal.c 2007-08-31 13:28:18.000000000 -0700
@@ -1611,7 +1611,12 @@ void journal_ack_err(journal_t *journal)

int journal_blocks_per_page(struct inode *inode)
{
- return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+ int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+ if (bits > 0)
+ return 1 << bits;
+ else
+ return 1;
}

/*
Index: my2.6/fs/jbd2/journal.c
===================================================================
--- my2.6.orig/fs/jbd2/journal.c 2007-08-31 13:32:21.000000000 -0700
+++ my2.6/fs/jbd2/journal.c 2007-08-31 13:32:30.000000000 -0700
@@ -1612,7 +1612,12 @@ void jbd2_journal_ack_err(journal_t *jou

int jbd2_journal_blocks_per_page(struct inode *inode)
{
- return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+ int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+ if (bits > 0)
+ return 1 << bits;
+ else
+ return 1;
}

/*


2007-09-01 18:39:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 1/2] JBD: slab management support for large block(>8k)

On Fri, Aug 31, 2007 at 05:12:18PM -0700, Mingming Cao wrote:
> >From clameter:
> Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.


But the real fix is to kill this code. We can't send down slab pages
down the block layer without breaking iscsi or aoe. And this code is
only used in so rare cases that all the normal testing won't hit it.
Very bad combination.

2007-09-02 11:40:40

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/2] JBD: slab management support for large block(>8k)

On Sat, 1 Sep 2007, Christoph Hellwig wrote:

> On Fri, Aug 31, 2007 at 05:12:18PM -0700, Mingming Cao wrote:
> > >From clameter:
> > Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.
>
>
> But the real fix is to kill this code. We can't send down slab pages
> down the block layer without breaking iscsi or aoe. And this code is
> only used in so rare cases that all the normal testing won't hit it.
> Very bad combination.

We are doing what you describe right now. So the current code is broken?


2007-09-02 15:28:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 1/2] JBD: slab management support for large block(>8k)

On Sun, Sep 02, 2007 at 04:40:21AM -0700, Christoph Lameter wrote:
> On Sat, 1 Sep 2007, Christoph Hellwig wrote:
>
> > On Fri, Aug 31, 2007 at 05:12:18PM -0700, Mingming Cao wrote:
> > > >From clameter:
> > > Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.
> >
> >
> > But the real fix is to kill this code. We can't send down slab pages
> > down the block layer without breaking iscsi or aoe. And this code is
> > only used in so rare cases that all the normal testing won't hit it.
> > Very bad combination.
>
> We are doing what you describe right now. So the current code is broken?

Yes.

2007-09-03 07:55:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/2] JBD: slab management support for large block(>8k)

On Sun, 2 Sep 2007, Christoph Hellwig wrote:

> > We are doing what you describe right now. So the current code is broken?
> Yes.

How about getting rid of the slabs there and use kmalloc? Kmalloc in mm
(and therfore hopefully 2.6.24) will convert kmallocs > PAGE_SIZE to page
allocator calls. Not sure what to do about the 1k and 2k requests though.

2007-09-03 13:41:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 1/2] JBD: slab management support for large block(>8k)

On Mon, Sep 03, 2007 at 12:55:04AM -0700, Christoph Lameter wrote:
> On Sun, 2 Sep 2007, Christoph Hellwig wrote:
>
> > > We are doing what you describe right now. So the current code is broken?
> > Yes.
>
> How about getting rid of the slabs there and use kmalloc? Kmalloc in mm
> (and therfore hopefully 2.6.24) will convert kmallocs > PAGE_SIZE to page
> allocator calls. Not sure what to do about the 1k and 2k requests though.

The problem is that we must never use kmalloc pages, so we always need
to request a page or more for these. Better to use get_free_page directly,
that's how I fixed it in XFS a while ago.

2007-09-03 19:32:28

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC 1/2] JBD: slab management support for large block(>8k)

On Mon, 3 Sep 2007, Christoph Hellwig wrote:

> > How about getting rid of the slabs there and use kmalloc? Kmalloc in mm
> > (and therfore hopefully 2.6.24) will convert kmallocs > PAGE_SIZE to page
> > allocator calls. Not sure what to do about the 1k and 2k requests though.
>
> The problem is that we must never use kmalloc pages, so we always need
> to request a page or more for these. Better to use get_free_page directly,
> that's how I fixed it in XFS a while ago.

So you'd be fine with replacing the allocs with

get_free_pages(GFP_xxx, get_order(size)) ?

2007-09-03 19:33:33

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC 1/2] JBD: slab management support for large block(>8k)

On Mon, Sep 03, 2007 at 12:31:49PM -0700, Christoph Lameter wrote:
> So you'd be fine with replacing the allocs with
>
> get_free_pages(GFP_xxx, get_order(size)) ?

Yes. And rip out all that code related to setting up the slabs. I plan
to add WARN_ONs to bio_add_page and friends to detect further usage of
slab pages if there is any.

2007-09-14 18:54:19

by Mingming Cao

[permalink] [raw]
Subject: [PATCH] JBD slab cleanups

jbd/jbd2: Replace slab allocations with page cache allocations

From: Christoph Lameter <[email protected]>

JBD should not pass slab pages down to the block layer.
Use page allocator pages instead. This will also prepare
JBD for the large blocksize patchset.

Tested on 2.6.23-rc6 with fsx runs fine.

Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---
fs/jbd/checkpoint.c | 2
fs/jbd/commit.c | 6 +-
fs/jbd/journal.c | 107 ++++---------------------------------------------
fs/jbd/transaction.c | 10 ++--
fs/jbd2/checkpoint.c | 2
fs/jbd2/commit.c | 6 +-
fs/jbd2/journal.c | 109 ++++----------------------------------------------
fs/jbd2/transaction.c | 18 ++++----
include/linux/jbd.h | 23 +++++++++-
include/linux/jbd2.h | 28 ++++++++++--
10 files changed, 83 insertions(+), 228 deletions(-)

Index: linux-2.6.23-rc5/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/journal.c 2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/journal.c 2007-09-13 13:45:39.000000000 -0700
@@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);

static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
static void __journal_abort_soft (journal_t *journal, int errno);
-static int journal_create_jbd_slab(size_t slab_size);

/*
* Helper function used to manage commit timeouts
@@ -334,10 +333,10 @@ repeat:
char *tmp;

jbd_unlock_bh_state(bh_in);
- tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS);
+ tmp = jbd_alloc(bh_in->b_size, GFP_NOFS);
jbd_lock_bh_state(bh_in);
if (jh_in->b_frozen_data) {
- jbd_slab_free(tmp, bh_in->b_size);
+ jbd_free(tmp, bh_in->b_size);
goto repeat;
}

@@ -679,7 +678,7 @@ static journal_t * journal_init_common (
/* Set up a default-sized revoke table for the new mount. */
err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
if (err) {
- kfree(journal);
+ jbd_kfree(journal);
goto fail;
}
return journal;
@@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
- kfree(journal);
+ jbd_kfree(journal);
journal = NULL;
goto out;
}
@@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
- kfree(journal);
+ jbd_kfree(journal);
return NULL;
}

@@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i
if (err) {
printk(KERN_ERR "%s: Cannnot locate journal superblock\n",
__FUNCTION__);
- kfree(journal);
+ jbd_kfree(journal);
return NULL;
}

@@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
}
}

- /*
- * Create a slab for this blocksize
- */
- err = journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
- if (err)
- return err;
-
/* Let the recovery code check whether it needs to recover any
* data from the journal. */
if (journal_recover(journal))
@@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal)
if (journal->j_revoke)
journal_destroy_revoke(journal);
kfree(journal->j_wbuf);
- kfree(journal);
+ jbd_kfree(journal);
}


@@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
}

/*
- * Simple support for retrying memory allocations. Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
- return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size) (size >> 11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
- "jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
-};
-
-static void journal_destroy_jbd_slabs(void)
-{
- int i;
-
- for (i = 0; i < JBD_MAX_SLABS; i++) {
- if (jbd_slab[i])
- kmem_cache_destroy(jbd_slab[i]);
- jbd_slab[i] = NULL;
- }
-}
-
-static int journal_create_jbd_slab(size_t slab_size)
-{
- int i = JBD_SLAB_INDEX(slab_size);
-
- BUG_ON(i >= JBD_MAX_SLABS);
-
- /*
- * Check if we already have a slab created for this size
- */
- if (jbd_slab[i])
- return 0;
-
- /*
- * Create a slab and force alignment to be same as slabsize -
- * this will make sure that allocations won't cross the page
- * boundary.
- */
- jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
- slab_size, slab_size, 0, NULL);
- if (!jbd_slab[i]) {
- printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
- return -ENOMEM;
- }
- return 0;
-}
-
-void * jbd_slab_alloc(size_t size, gfp_t flags)
-{
- int idx;
-
- idx = JBD_SLAB_INDEX(size);
- BUG_ON(jbd_slab[idx] == NULL);
- return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
-}
-
-void jbd_slab_free(void *ptr, size_t size)
-{
- int idx;
-
- idx = JBD_SLAB_INDEX(size);
- BUG_ON(jbd_slab[idx] == NULL);
- kmem_cache_free(jbd_slab[idx], ptr);
-}
-
-/*
* Journal_head storage management
*/
static struct kmem_cache *journal_head_cache;
@@ -1881,13 +1793,13 @@ static void __journal_remove_journal_hea
printk(KERN_WARNING "%s: freeing "
"b_frozen_data\n",
__FUNCTION__);
- jbd_slab_free(jh->b_frozen_data, bh->b_size);
+ jbd_free(jh->b_frozen_data, bh->b_size);
}
if (jh->b_committed_data) {
printk(KERN_WARNING "%s: freeing "
"b_committed_data\n",
__FUNCTION__);
- jbd_slab_free(jh->b_committed_data, bh->b_size);
+ jbd_free(jh->b_committed_data, bh->b_size);
}
bh->b_private = NULL;
jh->b_bh = NULL; /* debug, really */
@@ -2042,7 +1954,6 @@ static void journal_destroy_caches(void)
journal_destroy_revoke_caches();
journal_destroy_journal_head_cache();
journal_destroy_handle_cache();
- journal_destroy_jbd_slabs();
}

static int __init journal_init(void)
Index: linux-2.6.23-rc5/include/linux/jbd.h
===================================================================
--- linux-2.6.23-rc5.orig/include/linux/jbd.h 2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/include/linux/jbd.h 2007-09-13 13:42:27.000000000 -0700
@@ -71,9 +71,26 @@ extern int journal_enable_debug;
#define jbd_debug(f, a...) /**/
#endif

-extern void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
-extern void * jbd_slab_alloc(size_t size, gfp_t flags);
-extern void jbd_slab_free(void *ptr, size_t size);
+static inline void *__jbd_kmalloc(const char *where, size_t size,
+ gfp_t flags, int retry)
+{
+ return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
+}
+
+static inline void jbd_kfree(void *ptr)
+{
+ return kfree(ptr);
+}
+
+static inline void *jbd_alloc(size_t size, gfp_t flags)
+{
+ return (void *)__get_free_pages(flags, get_order(size));
+}
+
+static inline void jbd_free(void *ptr, size_t size)
+{
+ free_pages((unsigned long)ptr, get_order(size));
+};

#define jbd_kmalloc(size, flags) \
__jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
Index: linux-2.6.23-rc5/include/linux/jbd2.h
===================================================================
--- linux-2.6.23-rc5.orig/include/linux/jbd2.h 2007-09-13 13:37:58.000000000 -0700
+++ linux-2.6.23-rc5/include/linux/jbd2.h 2007-09-13 13:51:49.000000000 -0700
@@ -71,11 +71,27 @@ extern u8 jbd2_journal_enable_debug;
#define jbd_debug(f, a...) /**/
#endif

-extern void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
-extern void * jbd2_slab_alloc(size_t size, gfp_t flags);
-extern void jbd2_slab_free(void *ptr, size_t size);
+static inline void *__jbd2_kmalloc(const char *where, size_t size,
+ gfp_t flags, int retry)
+{
+ return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
+}
+static inline void jbd2_kfree(void *ptr)
+{
+ return kfree(ptr);
+}
+
+static inline void *jbd2_alloc(size_t size, gfp_t flags)
+{
+ return (void *)__get_free_pages(flags, get_order(size));
+}
+
+static inline void jbd2_free(void *ptr, size_t size)
+{
+ free_pages((unsigned long)ptr, get_order(size));
+};

-#define jbd_kmalloc(size, flags) \
+#define jbd2_kmalloc(size, flags) \
__jbd2_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
#define jbd_rep_kmalloc(size, flags) \
__jbd2_kmalloc(__FUNCTION__, (size), (flags), 1)
@@ -959,12 +975,12 @@ void jbd2_journal_put_journal_head(struc
*/
extern struct kmem_cache *jbd2_handle_cache;

-static inline handle_t *jbd_alloc_handle(gfp_t gfp_flags)
+static inline handle_t *jbd2_alloc_handle(gfp_t gfp_flags)
{
return kmem_cache_alloc(jbd2_handle_cache, gfp_flags);
}

-static inline void jbd_free_handle(handle_t *handle)
+static inline void jbd2_free_handle(handle_t *handle)
{
kmem_cache_free(jbd2_handle_cache, handle);
}
Index: linux-2.6.23-rc5/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd2/journal.c 2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd2/journal.c 2007-09-13 14:00:17.000000000 -0700
@@ -84,7 +84,6 @@ EXPORT_SYMBOL(jbd2_journal_force_commit)

static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
static void __journal_abort_soft (journal_t *journal, int errno);
-static int jbd2_journal_create_jbd_slab(size_t slab_size);

/*
* Helper function used to manage commit timeouts
@@ -335,10 +334,10 @@ repeat:
char *tmp;

jbd_unlock_bh_state(bh_in);
- tmp = jbd2_slab_alloc(bh_in->b_size, GFP_NOFS);
+ tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
jbd_lock_bh_state(bh_in);
if (jh_in->b_frozen_data) {
- jbd2_slab_free(tmp, bh_in->b_size);
+ jbd2_free(tmp, bh_in->b_size);
goto repeat;
}

@@ -655,7 +654,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;

- journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
+ journal = jbd2_kmalloc(sizeof(*journal), GFP_KERNEL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
@@ -680,7 +679,7 @@ static journal_t * journal_init_common (
/* Set up a default-sized revoke table for the new mount. */
err = jbd2_journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
if (err) {
- kfree(journal);
+ jbd2_kfree(journal);
goto fail;
}
return journal;
@@ -729,7 +728,7 @@ journal_t * jbd2_journal_init_dev(struct
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
- kfree(journal);
+ jbd2_kfree(journal);
journal = NULL;
goto out;
}
@@ -783,7 +782,7 @@ journal_t * jbd2_journal_init_inode (str
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
- kfree(journal);
+ jbd2_kfree(journal);
return NULL;
}

@@ -792,7 +791,7 @@ journal_t * jbd2_journal_init_inode (str
if (err) {
printk(KERN_ERR "%s: Cannnot locate journal superblock\n",
__FUNCTION__);
- kfree(journal);
+ jbd2_kfree(journal);
return NULL;
}

@@ -1096,13 +1095,6 @@ int jbd2_journal_load(journal_t *journal
}
}

- /*
- * Create a slab for this blocksize
- */
- err = jbd2_journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
- if (err)
- return err;
-
/* Let the recovery code check whether it needs to recover any
* data from the journal. */
if (jbd2_journal_recover(journal))
@@ -1167,7 +1159,7 @@ void jbd2_journal_destroy(journal_t *jou
if (journal->j_revoke)
jbd2_journal_destroy_revoke(journal);
kfree(journal->j_wbuf);
- kfree(journal);
+ jbd2_kfree(journal);
}


@@ -1627,86 +1619,6 @@ size_t journal_tag_bytes(journal_t *jour
}

/*
- * Simple support for retrying memory allocations. Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
- return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size) (size >> 11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
- "jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k"
-};
-
-static void jbd2_journal_destroy_jbd_slabs(void)
-{
- int i;
-
- for (i = 0; i < JBD_MAX_SLABS; i++) {
- if (jbd_slab[i])
- kmem_cache_destroy(jbd_slab[i]);
- jbd_slab[i] = NULL;
- }
-}
-
-static int jbd2_journal_create_jbd_slab(size_t slab_size)
-{
- int i = JBD_SLAB_INDEX(slab_size);
-
- BUG_ON(i >= JBD_MAX_SLABS);
-
- /*
- * Check if we already have a slab created for this size
- */
- if (jbd_slab[i])
- return 0;
-
- /*
- * Create a slab and force alignment to be same as slabsize -
- * this will make sure that allocations won't cross the page
- * boundary.
- */
- jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
- slab_size, slab_size, 0, NULL);
- if (!jbd_slab[i]) {
- printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
- return -ENOMEM;
- }
- return 0;
-}
-
-void * jbd2_slab_alloc(size_t size, gfp_t flags)
-{
- int idx;
-
- idx = JBD_SLAB_INDEX(size);
- BUG_ON(jbd_slab[idx] == NULL);
- return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
-}
-
-void jbd2_slab_free(void *ptr, size_t size)
-{
- int idx;
-
- idx = JBD_SLAB_INDEX(size);
- BUG_ON(jbd_slab[idx] == NULL);
- kmem_cache_free(jbd_slab[idx], ptr);
-}
-
-/*
* Journal_head storage management
*/
static struct kmem_cache *jbd2_journal_head_cache;
@@ -1893,13 +1805,13 @@ static void __journal_remove_journal_hea
printk(KERN_WARNING "%s: freeing "
"b_frozen_data\n",
__FUNCTION__);
- jbd2_slab_free(jh->b_frozen_data, bh->b_size);
+ jbd2_free(jh->b_frozen_data, bh->b_size);
}
if (jh->b_committed_data) {
printk(KERN_WARNING "%s: freeing "
"b_committed_data\n",
__FUNCTION__);
- jbd2_slab_free(jh->b_committed_data, bh->b_size);
+ jbd2_free(jh->b_committed_data, bh->b_size);
}
bh->b_private = NULL;
jh->b_bh = NULL; /* debug, really */
@@ -2040,7 +1952,6 @@ static void jbd2_journal_destroy_caches(
jbd2_journal_destroy_revoke_caches();
jbd2_journal_destroy_jbd2_journal_head_cache();
jbd2_journal_destroy_handle_cache();
- jbd2_journal_destroy_jbd_slabs();
}

static int __init journal_init(void)
Index: linux-2.6.23-rc5/fs/jbd/commit.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/commit.c 2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/commit.c 2007-09-13 13:40:03.000000000 -0700
@@ -375,7 +375,7 @@ void journal_commit_transaction(journal_
struct buffer_head *bh = jh2bh(jh);

jbd_lock_bh_state(bh);
- jbd_slab_free(jh->b_committed_data, bh->b_size);
+ jbd_free(jh->b_committed_data, bh->b_size);
jh->b_committed_data = NULL;
jbd_unlock_bh_state(bh);
}
@@ -792,14 +792,14 @@ restart_loop:
* Otherwise, we can just throw away the frozen data now.
*/
if (jh->b_committed_data) {
- jbd_slab_free(jh->b_committed_data, bh->b_size);
+ jbd_free(jh->b_committed_data, bh->b_size);
jh->b_committed_data = NULL;
if (jh->b_frozen_data) {
jh->b_committed_data = jh->b_frozen_data;
jh->b_frozen_data = NULL;
}
} else if (jh->b_frozen_data) {
- jbd_slab_free(jh->b_frozen_data, bh->b_size);
+ jbd_free(jh->b_frozen_data, bh->b_size);
jh->b_frozen_data = NULL;
}

Index: linux-2.6.23-rc5/fs/jbd2/commit.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd2/commit.c 2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd2/commit.c 2007-09-13 13:40:03.000000000 -0700
@@ -384,7 +384,7 @@ void jbd2_journal_commit_transaction(jou
struct buffer_head *bh = jh2bh(jh);

jbd_lock_bh_state(bh);
- jbd2_slab_free(jh->b_committed_data, bh->b_size);
+ jbd2_free(jh->b_committed_data, bh->b_size);
jh->b_committed_data = NULL;
jbd_unlock_bh_state(bh);
}
@@ -801,14 +801,14 @@ restart_loop:
* Otherwise, we can just throw away the frozen data now.
*/
if (jh->b_committed_data) {
- jbd2_slab_free(jh->b_committed_data, bh->b_size);
+ jbd2_free(jh->b_committed_data, bh->b_size);
jh->b_committed_data = NULL;
if (jh->b_frozen_data) {
jh->b_committed_data = jh->b_frozen_data;
jh->b_frozen_data = NULL;
}
} else if (jh->b_frozen_data) {
- jbd2_slab_free(jh->b_frozen_data, bh->b_size);
+ jbd2_free(jh->b_frozen_data, bh->b_size);
jh->b_frozen_data = NULL;
}

Index: linux-2.6.23-rc5/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/transaction.c 2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/transaction.c 2007-09-13 13:46:23.000000000 -0700
@@ -229,7 +229,7 @@ repeat_locked:
spin_unlock(&journal->j_state_lock);
out:
if (unlikely(new_transaction)) /* It's usually NULL */
- kfree(new_transaction);
+ jbd_kfree(new_transaction);
return ret;
}

@@ -668,7 +668,7 @@ repeat:
JBUFFER_TRACE(jh, "allocate memory for buffer");
jbd_unlock_bh_state(bh);
frozen_buffer =
- jbd_slab_alloc(jh2bh(jh)->b_size,
+ jbd_alloc(jh2bh(jh)->b_size,
GFP_NOFS);
if (!frozen_buffer) {
printk(KERN_EMERG
@@ -728,7 +728,7 @@ done:

out:
if (unlikely(frozen_buffer)) /* It's usually NULL */
- jbd_slab_free(frozen_buffer, bh->b_size);
+ jbd_free(frozen_buffer, bh->b_size);

JBUFFER_TRACE(jh, "exit");
return error;
@@ -881,7 +881,7 @@ int journal_get_undo_access(handle_t *ha

repeat:
if (!jh->b_committed_data) {
- committed_data = jbd_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+ committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS);
if (!committed_data) {
printk(KERN_EMERG "%s: No memory for committed data\n",
__FUNCTION__);
@@ -908,7 +908,7 @@ repeat:
out:
journal_put_journal_head(jh);
if (unlikely(committed_data))
- jbd_slab_free(committed_data, bh->b_size);
+ jbd_free(committed_data, bh->b_size);
return err;
}

Index: linux-2.6.23-rc5/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd2/transaction.c 2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd2/transaction.c 2007-09-13 13:59:20.000000000 -0700
@@ -96,7 +96,7 @@ static int start_this_handle(journal_t *

alloc_transaction:
if (!journal->j_running_transaction) {
- new_transaction = jbd_kmalloc(sizeof(*new_transaction),
+ new_transaction = jbd2_kmalloc(sizeof(*new_transaction),
GFP_NOFS);
if (!new_transaction) {
ret = -ENOMEM;
@@ -229,14 +229,14 @@ repeat_locked:
spin_unlock(&journal->j_state_lock);
out:
if (unlikely(new_transaction)) /* It's usually NULL */
- kfree(new_transaction);
+ jbd2_kfree(new_transaction);
return ret;
}

/* Allocate a new handle. This should probably be in a slab... */
static handle_t *new_handle(int nblocks)
{
- handle_t *handle = jbd_alloc_handle(GFP_NOFS);
+ handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
if (!handle)
return NULL;
memset(handle, 0, sizeof(*handle));
@@ -282,7 +282,7 @@ handle_t *jbd2_journal_start(journal_t *

err = start_this_handle(journal, handle);
if (err < 0) {
- jbd_free_handle(handle);
+ jbd2_free_handle(handle);
current->journal_info = NULL;
handle = ERR_PTR(err);
}
@@ -668,7 +668,7 @@ repeat:
JBUFFER_TRACE(jh, "allocate memory for buffer");
jbd_unlock_bh_state(bh);
frozen_buffer =
- jbd2_slab_alloc(jh2bh(jh)->b_size,
+ jbd2_alloc(jh2bh(jh)->b_size,
GFP_NOFS);
if (!frozen_buffer) {
printk(KERN_EMERG
@@ -728,7 +728,7 @@ done:

out:
if (unlikely(frozen_buffer)) /* It's usually NULL */
- jbd2_slab_free(frozen_buffer, bh->b_size);
+ jbd2_free(frozen_buffer, bh->b_size);

JBUFFER_TRACE(jh, "exit");
return error;
@@ -881,7 +881,7 @@ int jbd2_journal_get_undo_access(handle_

repeat:
if (!jh->b_committed_data) {
- committed_data = jbd2_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+ committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
if (!committed_data) {
printk(KERN_EMERG "%s: No memory for committed data\n",
__FUNCTION__);
@@ -908,7 +908,7 @@ repeat:
out:
jbd2_journal_put_journal_head(jh);
if (unlikely(committed_data))
- jbd2_slab_free(committed_data, bh->b_size);
+ jbd2_free(committed_data, bh->b_size);
return err;
}

@@ -1411,7 +1411,7 @@ int jbd2_journal_stop(handle_t *handle)
spin_unlock(&journal->j_state_lock);
}

- jbd_free_handle(handle);
+ jbd2_free_handle(handle);
return err;
}

Index: linux-2.6.23-rc5/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/checkpoint.c 2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/checkpoint.c 2007-09-14 09:57:21.000000000 -0700
@@ -693,5 +693,5 @@ void __journal_drop_transaction(journal_
J_ASSERT(journal->j_running_transaction != transaction);

jbd_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
- kfree(transaction);
+ jbd_kfree(transaction);
}
Index: linux-2.6.23-rc5/fs/jbd2/checkpoint.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd2/checkpoint.c 2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd2/checkpoint.c 2007-09-14 09:57:03.000000000 -0700
@@ -693,5 +693,5 @@ void __jbd2_journal_drop_transaction(jou
J_ASSERT(journal->j_running_transaction != transaction);

jbd_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
- kfree(transaction);
+ jbd2_kfree(transaction);
}


2007-09-14 18:58:25

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

Thanks Mingming.


2007-09-17 19:30:13

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
> jbd/jbd2: Replace slab allocations with page cache allocations
>
> From: Christoph Lameter <[email protected]>
>
> JBD should not pass slab pages down to the block layer.
> Use page allocator pages instead. This will also prepare
> JBD for the large blocksize patchset.
>

Currently memory allocation for committed_data(and frozen_buffer) for
bufferhead is done through jbd slab management, as Christoph Hellwig
pointed out that this is broken as jbd should not pass slab pages down
to IO layer. and suggested to use get_free_pages() directly.

The problem with this patch, as Andreas Dilger pointed today in ext4
interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
1/3-1/2 page space.

What was the originally intention to set up slabs for committed_data(and
frozen_buffer) in JBD? Why not using kmalloc?

Mingming

> Tested on 2.6.23-rc6 with fsx runs fine.
>
> Signed-off-by: Christoph Lameter <[email protected]>
> Signed-off-by: Mingming Cao <[email protected]>
> ---
> fs/jbd/checkpoint.c | 2
> fs/jbd/commit.c | 6 +-
> fs/jbd/journal.c | 107 ++++---------------------------------------------
> fs/jbd/transaction.c | 10 ++--
> fs/jbd2/checkpoint.c | 2
> fs/jbd2/commit.c | 6 +-
> fs/jbd2/journal.c | 109 ++++----------------------------------------------
> fs/jbd2/transaction.c | 18 ++++----
> include/linux/jbd.h | 23 +++++++++-
> include/linux/jbd2.h | 28 ++++++++++--
> 10 files changed, 83 insertions(+), 228 deletions(-)
>
> Index: linux-2.6.23-rc5/fs/jbd/journal.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd/journal.c 2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd/journal.c 2007-09-13 13:45:39.000000000 -0700
> @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
>
> static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
> static void __journal_abort_soft (journal_t *journal, int errno);
> -static int journal_create_jbd_slab(size_t slab_size);
>
> /*
> * Helper function used to manage commit timeouts
> @@ -334,10 +333,10 @@ repeat:
> char *tmp;
>
> jbd_unlock_bh_state(bh_in);
> - tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS);
> + tmp = jbd_alloc(bh_in->b_size, GFP_NOFS);
> jbd_lock_bh_state(bh_in);
> if (jh_in->b_frozen_data) {
> - jbd_slab_free(tmp, bh_in->b_size);
> + jbd_free(tmp, bh_in->b_size);
> goto repeat;
> }
>
> @@ -679,7 +678,7 @@ static journal_t * journal_init_common (
> /* Set up a default-sized revoke table for the new mount. */
> err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
> if (err) {
> - kfree(journal);
> + jbd_kfree(journal);
> goto fail;
> }
> return journal;
> @@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc
> if (!journal->j_wbuf) {
> printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> __FUNCTION__);
> - kfree(journal);
> + jbd_kfree(journal);
> journal = NULL;
> goto out;
> }
> @@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i
> if (!journal->j_wbuf) {
> printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> __FUNCTION__);
> - kfree(journal);
> + jbd_kfree(journal);
> return NULL;
> }
>
> @@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i
> if (err) {
> printk(KERN_ERR "%s: Cannnot locate journal superblock\n",
> __FUNCTION__);
> - kfree(journal);
> + jbd_kfree(journal);
> return NULL;
> }
>
> @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
> }
> }
>
> - /*
> - * Create a slab for this blocksize
> - */
> - err = journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
> - if (err)
> - return err;
> -
> /* Let the recovery code check whether it needs to recover any
> * data from the journal. */
> if (journal_recover(journal))
> @@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal)
> if (journal->j_revoke)
> journal_destroy_revoke(journal);
> kfree(journal->j_wbuf);
> - kfree(journal);
> + jbd_kfree(journal);
> }
>
>
> @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
> }
>
> /*
> - * Simple support for retrying memory allocations. Introduced to help to
> - * debug different VM deadlock avoidance strategies.
> - */
> -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
> -{
> - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
> -}
> -
> -/*
> - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
> - * and allocate frozen and commit buffers from these slabs.
> - *
> - * Reason for doing this is to avoid, SLAB_DEBUG - since it could
> - * cause bh to cross page boundary.
> - */
> -
> -#define JBD_MAX_SLABS 5
> -#define JBD_SLAB_INDEX(size) (size >> 11)
> -
> -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
> -static const char *jbd_slab_names[JBD_MAX_SLABS] = {
> - "jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
> -};
> -
> -static void journal_destroy_jbd_slabs(void)
> -{
> - int i;
> -
> - for (i = 0; i < JBD_MAX_SLABS; i++) {
> - if (jbd_slab[i])
> - kmem_cache_destroy(jbd_slab[i]);
> - jbd_slab[i] = NULL;
> - }
> -}
> -
> -static int journal_create_jbd_slab(size_t slab_size)
> -{
> - int i = JBD_SLAB_INDEX(slab_size);
> -
> - BUG_ON(i >= JBD_MAX_SLABS);
> -
> - /*
> - * Check if we already have a slab created for this size
> - */
> - if (jbd_slab[i])
> - return 0;
> -
> - /*
> - * Create a slab and force alignment to be same as slabsize -
> - * this will make sure that allocations won't cross the page
> - * boundary.
> - */
> - jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
> - slab_size, slab_size, 0, NULL);
> - if (!jbd_slab[i]) {
> - printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
> - return -ENOMEM;
> - }
> - return 0;
> -}
> -
> -void * jbd_slab_alloc(size_t size, gfp_t flags)
> -{
> - int idx;
> -
> - idx = JBD_SLAB_INDEX(size);
> - BUG_ON(jbd_slab[idx] == NULL);
> - return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
> -}
> -
> -void jbd_slab_free(void *ptr, size_t size)
> -{
> - int idx;
> -
> - idx = JBD_SLAB_INDEX(size);
> - BUG_ON(jbd_slab[idx] == NULL);
> - kmem_cache_free(jbd_slab[idx], ptr);
> -}
> -
> -/*
> * Journal_head storage management
> */
> static struct kmem_cache *journal_head_cache;
> @@ -1881,13 +1793,13 @@ static void __journal_remove_journal_hea
> printk(KERN_WARNING "%s: freeing "
> "b_frozen_data\n",
> __FUNCTION__);
> - jbd_slab_free(jh->b_frozen_data, bh->b_size);
> + jbd_free(jh->b_frozen_data, bh->b_size);
> }
> if (jh->b_committed_data) {
> printk(KERN_WARNING "%s: freeing "
> "b_committed_data\n",
> __FUNCTION__);
> - jbd_slab_free(jh->b_committed_data, bh->b_size);
> + jbd_free(jh->b_committed_data, bh->b_size);
> }
> bh->b_private = NULL;
> jh->b_bh = NULL; /* debug, really */
> @@ -2042,7 +1954,6 @@ static void journal_destroy_caches(void)
> journal_destroy_revoke_caches();
> journal_destroy_journal_head_cache();
> journal_destroy_handle_cache();
> - journal_destroy_jbd_slabs();
> }
>
> static int __init journal_init(void)
> Index: linux-2.6.23-rc5/include/linux/jbd.h
> ===================================================================
> --- linux-2.6.23-rc5.orig/include/linux/jbd.h 2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/include/linux/jbd.h 2007-09-13 13:42:27.000000000 -0700
> @@ -71,9 +71,26 @@ extern int journal_enable_debug;
> #define jbd_debug(f, a...) /**/
> #endif
>
> -extern void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
> -extern void * jbd_slab_alloc(size_t size, gfp_t flags);
> -extern void jbd_slab_free(void *ptr, size_t size);
> +static inline void *__jbd_kmalloc(const char *where, size_t size,
> + gfp_t flags, int retry)
> +{
> + return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
> +}
> +
> +static inline void jbd_kfree(void *ptr)
> +{
> + return kfree(ptr);
> +}
> +
> +static inline void *jbd_alloc(size_t size, gfp_t flags)
> +{
> + return (void *)__get_free_pages(flags, get_order(size));
> +}
> +
> +static inline void jbd_free(void *ptr, size_t size)
> +{
> + free_pages((unsigned long)ptr, get_order(size));
> +};
>
> #define jbd_kmalloc(size, flags) \
> __jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
> Index: linux-2.6.23-rc5/include/linux/jbd2.h
> ===================================================================
> --- linux-2.6.23-rc5.orig/include/linux/jbd2.h 2007-09-13 13:37:58.000000000 -0700
> +++ linux-2.6.23-rc5/include/linux/jbd2.h 2007-09-13 13:51:49.000000000 -0700
> @@ -71,11 +71,27 @@ extern u8 jbd2_journal_enable_debug;
> #define jbd_debug(f, a...) /**/
> #endif
>
> -extern void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
> -extern void * jbd2_slab_alloc(size_t size, gfp_t flags);
> -extern void jbd2_slab_free(void *ptr, size_t size);
> +static inline void *__jbd2_kmalloc(const char *where, size_t size,
> + gfp_t flags, int retry)
> +{
> + return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
> +}
> +static inline void jbd2_kfree(void *ptr)
> +{
> + return kfree(ptr);
> +}
> +
> +static inline void *jbd2_alloc(size_t size, gfp_t flags)
> +{
> + return (void *)__get_free_pages(flags, get_order(size));
> +}
> +
> +static inline void jbd2_free(void *ptr, size_t size)
> +{
> + free_pages((unsigned long)ptr, get_order(size));
> +};
>
> -#define jbd_kmalloc(size, flags) \
> +#define jbd2_kmalloc(size, flags) \
> __jbd2_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
> #define jbd_rep_kmalloc(size, flags) \
> __jbd2_kmalloc(__FUNCTION__, (size), (flags), 1)
> @@ -959,12 +975,12 @@ void jbd2_journal_put_journal_head(struc
> */
> extern struct kmem_cache *jbd2_handle_cache;
>
> -static inline handle_t *jbd_alloc_handle(gfp_t gfp_flags)
> +static inline handle_t *jbd2_alloc_handle(gfp_t gfp_flags)
> {
> return kmem_cache_alloc(jbd2_handle_cache, gfp_flags);
> }
>
> -static inline void jbd_free_handle(handle_t *handle)
> +static inline void jbd2_free_handle(handle_t *handle)
> {
> kmem_cache_free(jbd2_handle_cache, handle);
> }
> Index: linux-2.6.23-rc5/fs/jbd2/journal.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd2/journal.c 2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd2/journal.c 2007-09-13 14:00:17.000000000 -0700
> @@ -84,7 +84,6 @@ EXPORT_SYMBOL(jbd2_journal_force_commit)
>
> static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
> static void __journal_abort_soft (journal_t *journal, int errno);
> -static int jbd2_journal_create_jbd_slab(size_t slab_size);
>
> /*
> * Helper function used to manage commit timeouts
> @@ -335,10 +334,10 @@ repeat:
> char *tmp;
>
> jbd_unlock_bh_state(bh_in);
> - tmp = jbd2_slab_alloc(bh_in->b_size, GFP_NOFS);
> + tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
> jbd_lock_bh_state(bh_in);
> if (jh_in->b_frozen_data) {
> - jbd2_slab_free(tmp, bh_in->b_size);
> + jbd2_free(tmp, bh_in->b_size);
> goto repeat;
> }
>
> @@ -655,7 +654,7 @@ static journal_t * journal_init_common (
> journal_t *journal;
> int err;
>
> - journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
> + journal = jbd2_kmalloc(sizeof(*journal), GFP_KERNEL);
> if (!journal)
> goto fail;
> memset(journal, 0, sizeof(*journal));
> @@ -680,7 +679,7 @@ static journal_t * journal_init_common (
> /* Set up a default-sized revoke table for the new mount. */
> err = jbd2_journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
> if (err) {
> - kfree(journal);
> + jbd2_kfree(journal);
> goto fail;
> }
> return journal;
> @@ -729,7 +728,7 @@ journal_t * jbd2_journal_init_dev(struct
> if (!journal->j_wbuf) {
> printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> __FUNCTION__);
> - kfree(journal);
> + jbd2_kfree(journal);
> journal = NULL;
> goto out;
> }
> @@ -783,7 +782,7 @@ journal_t * jbd2_journal_init_inode (str
> if (!journal->j_wbuf) {
> printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> __FUNCTION__);
> - kfree(journal);
> + jbd2_kfree(journal);
> return NULL;
> }
>
> @@ -792,7 +791,7 @@ journal_t * jbd2_journal_init_inode (str
> if (err) {
> printk(KERN_ERR "%s: Cannnot locate journal superblock\n",
> __FUNCTION__);
> - kfree(journal);
> + jbd2_kfree(journal);
> return NULL;
> }
>
> @@ -1096,13 +1095,6 @@ int jbd2_journal_load(journal_t *journal
> }
> }
>
> - /*
> - * Create a slab for this blocksize
> - */
> - err = jbd2_journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
> - if (err)
> - return err;
> -
> /* Let the recovery code check whether it needs to recover any
> * data from the journal. */
> if (jbd2_journal_recover(journal))
> @@ -1167,7 +1159,7 @@ void jbd2_journal_destroy(journal_t *jou
> if (journal->j_revoke)
> jbd2_journal_destroy_revoke(journal);
> kfree(journal->j_wbuf);
> - kfree(journal);
> + jbd2_kfree(journal);
> }
>
>
> @@ -1627,86 +1619,6 @@ size_t journal_tag_bytes(journal_t *jour
> }
>
> /*
> - * Simple support for retrying memory allocations. Introduced to help to
> - * debug different VM deadlock avoidance strategies.
> - */
> -void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
> -{
> - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
> -}
> -
> -/*
> - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
> - * and allocate frozen and commit buffers from these slabs.
> - *
> - * Reason for doing this is to avoid, SLAB_DEBUG - since it could
> - * cause bh to cross page boundary.
> - */
> -
> -#define JBD_MAX_SLABS 5
> -#define JBD_SLAB_INDEX(size) (size >> 11)
> -
> -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
> -static const char *jbd_slab_names[JBD_MAX_SLABS] = {
> - "jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k"
> -};
> -
> -static void jbd2_journal_destroy_jbd_slabs(void)
> -{
> - int i;
> -
> - for (i = 0; i < JBD_MAX_SLABS; i++) {
> - if (jbd_slab[i])
> - kmem_cache_destroy(jbd_slab[i]);
> - jbd_slab[i] = NULL;
> - }
> -}
> -
> -static int jbd2_journal_create_jbd_slab(size_t slab_size)
> -{
> - int i = JBD_SLAB_INDEX(slab_size);
> -
> - BUG_ON(i >= JBD_MAX_SLABS);
> -
> - /*
> - * Check if we already have a slab created for this size
> - */
> - if (jbd_slab[i])
> - return 0;
> -
> - /*
> - * Create a slab and force alignment to be same as slabsize -
> - * this will make sure that allocations won't cross the page
> - * boundary.
> - */
> - jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
> - slab_size, slab_size, 0, NULL);
> - if (!jbd_slab[i]) {
> - printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
> - return -ENOMEM;
> - }
> - return 0;
> -}
> -
> -void * jbd2_slab_alloc(size_t size, gfp_t flags)
> -{
> - int idx;
> -
> - idx = JBD_SLAB_INDEX(size);
> - BUG_ON(jbd_slab[idx] == NULL);
> - return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
> -}
> -
> -void jbd2_slab_free(void *ptr, size_t size)
> -{
> - int idx;
> -
> - idx = JBD_SLAB_INDEX(size);
> - BUG_ON(jbd_slab[idx] == NULL);
> - kmem_cache_free(jbd_slab[idx], ptr);
> -}
> -
> -/*
> * Journal_head storage management
> */
> static struct kmem_cache *jbd2_journal_head_cache;
> @@ -1893,13 +1805,13 @@ static void __journal_remove_journal_hea
> printk(KERN_WARNING "%s: freeing "
> "b_frozen_data\n",
> __FUNCTION__);
> - jbd2_slab_free(jh->b_frozen_data, bh->b_size);
> + jbd2_free(jh->b_frozen_data, bh->b_size);
> }
> if (jh->b_committed_data) {
> printk(KERN_WARNING "%s: freeing "
> "b_committed_data\n",
> __FUNCTION__);
> - jbd2_slab_free(jh->b_committed_data, bh->b_size);
> + jbd2_free(jh->b_committed_data, bh->b_size);
> }
> bh->b_private = NULL;
> jh->b_bh = NULL; /* debug, really */
> @@ -2040,7 +1952,6 @@ static void jbd2_journal_destroy_caches(
> jbd2_journal_destroy_revoke_caches();
> jbd2_journal_destroy_jbd2_journal_head_cache();
> jbd2_journal_destroy_handle_cache();
> - jbd2_journal_destroy_jbd_slabs();
> }
>
> static int __init journal_init(void)
> Index: linux-2.6.23-rc5/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd/commit.c 2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd/commit.c 2007-09-13 13:40:03.000000000 -0700
> @@ -375,7 +375,7 @@ void journal_commit_transaction(journal_
> struct buffer_head *bh = jh2bh(jh);
>
> jbd_lock_bh_state(bh);
> - jbd_slab_free(jh->b_committed_data, bh->b_size);
> + jbd_free(jh->b_committed_data, bh->b_size);
> jh->b_committed_data = NULL;
> jbd_unlock_bh_state(bh);
> }
> @@ -792,14 +792,14 @@ restart_loop:
> * Otherwise, we can just throw away the frozen data now.
> */
> if (jh->b_committed_data) {
> - jbd_slab_free(jh->b_committed_data, bh->b_size);
> + jbd_free(jh->b_committed_data, bh->b_size);
> jh->b_committed_data = NULL;
> if (jh->b_frozen_data) {
> jh->b_committed_data = jh->b_frozen_data;
> jh->b_frozen_data = NULL;
> }
> } else if (jh->b_frozen_data) {
> - jbd_slab_free(jh->b_frozen_data, bh->b_size);
> + jbd_free(jh->b_frozen_data, bh->b_size);
> jh->b_frozen_data = NULL;
> }
>
> Index: linux-2.6.23-rc5/fs/jbd2/commit.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd2/commit.c 2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd2/commit.c 2007-09-13 13:40:03.000000000 -0700
> @@ -384,7 +384,7 @@ void jbd2_journal_commit_transaction(jou
> struct buffer_head *bh = jh2bh(jh);
>
> jbd_lock_bh_state(bh);
> - jbd2_slab_free(jh->b_committed_data, bh->b_size);
> + jbd2_free(jh->b_committed_data, bh->b_size);
> jh->b_committed_data = NULL;
> jbd_unlock_bh_state(bh);
> }
> @@ -801,14 +801,14 @@ restart_loop:
> * Otherwise, we can just throw away the frozen data now.
> */
> if (jh->b_committed_data) {
> - jbd2_slab_free(jh->b_committed_data, bh->b_size);
> + jbd2_free(jh->b_committed_data, bh->b_size);
> jh->b_committed_data = NULL;
> if (jh->b_frozen_data) {
> jh->b_committed_data = jh->b_frozen_data;
> jh->b_frozen_data = NULL;
> }
> } else if (jh->b_frozen_data) {
> - jbd2_slab_free(jh->b_frozen_data, bh->b_size);
> + jbd2_free(jh->b_frozen_data, bh->b_size);
> jh->b_frozen_data = NULL;
> }
>
> Index: linux-2.6.23-rc5/fs/jbd/transaction.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd/transaction.c 2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd/transaction.c 2007-09-13 13:46:23.000000000 -0700
> @@ -229,7 +229,7 @@ repeat_locked:
> spin_unlock(&journal->j_state_lock);
> out:
> if (unlikely(new_transaction)) /* It's usually NULL */
> - kfree(new_transaction);
> + jbd_kfree(new_transaction);
> return ret;
> }
>
> @@ -668,7 +668,7 @@ repeat:
> JBUFFER_TRACE(jh, "allocate memory for buffer");
> jbd_unlock_bh_state(bh);
> frozen_buffer =
> - jbd_slab_alloc(jh2bh(jh)->b_size,
> + jbd_alloc(jh2bh(jh)->b_size,
> GFP_NOFS);
> if (!frozen_buffer) {
> printk(KERN_EMERG
> @@ -728,7 +728,7 @@ done:
>
> out:
> if (unlikely(frozen_buffer)) /* It's usually NULL */
> - jbd_slab_free(frozen_buffer, bh->b_size);
> + jbd_free(frozen_buffer, bh->b_size);
>
> JBUFFER_TRACE(jh, "exit");
> return error;
> @@ -881,7 +881,7 @@ int journal_get_undo_access(handle_t *ha
>
> repeat:
> if (!jh->b_committed_data) {
> - committed_data = jbd_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> + committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> if (!committed_data) {
> printk(KERN_EMERG "%s: No memory for committed data\n",
> __FUNCTION__);
> @@ -908,7 +908,7 @@ repeat:
> out:
> journal_put_journal_head(jh);
> if (unlikely(committed_data))
> - jbd_slab_free(committed_data, bh->b_size);
> + jbd_free(committed_data, bh->b_size);
> return err;
> }
>
> Index: linux-2.6.23-rc5/fs/jbd2/transaction.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd2/transaction.c 2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd2/transaction.c 2007-09-13 13:59:20.000000000 -0700
> @@ -96,7 +96,7 @@ static int start_this_handle(journal_t *
>
> alloc_transaction:
> if (!journal->j_running_transaction) {
> - new_transaction = jbd_kmalloc(sizeof(*new_transaction),
> + new_transaction = jbd2_kmalloc(sizeof(*new_transaction),
> GFP_NOFS);
> if (!new_transaction) {
> ret = -ENOMEM;
> @@ -229,14 +229,14 @@ repeat_locked:
> spin_unlock(&journal->j_state_lock);
> out:
> if (unlikely(new_transaction)) /* It's usually NULL */
> - kfree(new_transaction);
> + jbd2_kfree(new_transaction);
> return ret;
> }
>
> /* Allocate a new handle. This should probably be in a slab... */
> static handle_t *new_handle(int nblocks)
> {
> - handle_t *handle = jbd_alloc_handle(GFP_NOFS);
> + handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
> if (!handle)
> return NULL;
> memset(handle, 0, sizeof(*handle));
> @@ -282,7 +282,7 @@ handle_t *jbd2_journal_start(journal_t *
>
> err = start_this_handle(journal, handle);
> if (err < 0) {
> - jbd_free_handle(handle);
> + jbd2_free_handle(handle);
> current->journal_info = NULL;
> handle = ERR_PTR(err);
> }
> @@ -668,7 +668,7 @@ repeat:
> JBUFFER_TRACE(jh, "allocate memory for buffer");
> jbd_unlock_bh_state(bh);
> frozen_buffer =
> - jbd2_slab_alloc(jh2bh(jh)->b_size,
> + jbd2_alloc(jh2bh(jh)->b_size,
> GFP_NOFS);
> if (!frozen_buffer) {
> printk(KERN_EMERG
> @@ -728,7 +728,7 @@ done:
>
> out:
> if (unlikely(frozen_buffer)) /* It's usually NULL */
> - jbd2_slab_free(frozen_buffer, bh->b_size);
> + jbd2_free(frozen_buffer, bh->b_size);
>
> JBUFFER_TRACE(jh, "exit");
> return error;
> @@ -881,7 +881,7 @@ int jbd2_journal_get_undo_access(handle_
>
> repeat:
> if (!jh->b_committed_data) {
> - committed_data = jbd2_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> + committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> if (!committed_data) {
> printk(KERN_EMERG "%s: No memory for committed data\n",
> __FUNCTION__);
> @@ -908,7 +908,7 @@ repeat:
> out:
> jbd2_journal_put_journal_head(jh);
> if (unlikely(committed_data))
> - jbd2_slab_free(committed_data, bh->b_size);
> + jbd2_free(committed_data, bh->b_size);
> return err;
> }
>
> @@ -1411,7 +1411,7 @@ int jbd2_journal_stop(handle_t *handle)
> spin_unlock(&journal->j_state_lock);
> }
>
> - jbd_free_handle(handle);
> + jbd2_free_handle(handle);
> return err;
> }
>
> Index: linux-2.6.23-rc5/fs/jbd/checkpoint.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd/checkpoint.c 2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd/checkpoint.c 2007-09-14 09:57:21.000000000 -0700
> @@ -693,5 +693,5 @@ void __journal_drop_transaction(journal_
> J_ASSERT(journal->j_running_transaction != transaction);
>
> jbd_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
> - kfree(transaction);
> + jbd_kfree(transaction);
> }
> Index: linux-2.6.23-rc5/fs/jbd2/checkpoint.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd2/checkpoint.c 2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd2/checkpoint.c 2007-09-14 09:57:03.000000000 -0700
> @@ -693,5 +693,5 @@ void __jbd2_journal_drop_transaction(jou
> J_ASSERT(journal->j_running_transaction != transaction);
>
> jbd_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
> - kfree(transaction);
> + jbd2_kfree(transaction);
> }
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2007-09-17 19:34:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Mon, Sep 17, 2007 at 12:29:51PM -0700, Mingming Cao wrote:
> The problem with this patch, as Andreas Dilger pointed today in ext4
> interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
> 1/3-1/2 page space.
>
> What was the originally intention to set up slabs for committed_data(and
> frozen_buffer) in JBD? Why not using kmalloc?

kmalloc is using slabs :)

The intent was to avoid the wasted memory, but as we've repeated a gazillion
times wasted memory on a rather rare codepath doesn't really matter when
you just crash random storage drivers otherwise.

2007-09-17 21:58:33

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote:
> On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
> > jbd/jbd2: Replace slab allocations with page cache allocations
> >
> > From: Christoph Lameter <[email protected]>
> >
> > JBD should not pass slab pages down to the block layer.
> > Use page allocator pages instead. This will also prepare
> > JBD for the large blocksize patchset.
> >
>
> Currently memory allocation for committed_data(and frozen_buffer) for
> bufferhead is done through jbd slab management, as Christoph Hellwig
> pointed out that this is broken as jbd should not pass slab pages down
> to IO layer. and suggested to use get_free_pages() directly.
>
> The problem with this patch, as Andreas Dilger pointed today in ext4
> interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
> 1/3-1/2 page space.
>
> What was the originally intention to set up slabs for committed_data(and
> frozen_buffer) in JBD? Why not using kmalloc?
>
> Mingming

Looks good. Small suggestion is to get rid of all kmalloc() usages and
consistently use jbd_kmalloc() or jbd2_kmalloc().

Thanks,
Badari

2007-09-17 22:57:44

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Mon, 2007-09-17 at 15:01 -0700, Badari Pulavarty wrote:
> On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote:
> > On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
> > > jbd/jbd2: Replace slab allocations with page cache allocations
> > >
> > > From: Christoph Lameter <[email protected]>
> > >
> > > JBD should not pass slab pages down to the block layer.
> > > Use page allocator pages instead. This will also prepare
> > > JBD for the large blocksize patchset.
> > >
> >
> > Currently memory allocation for committed_data(and frozen_buffer) for
> > bufferhead is done through jbd slab management, as Christoph Hellwig
> > pointed out that this is broken as jbd should not pass slab pages down
> > to IO layer. and suggested to use get_free_pages() directly.
> >
> > The problem with this patch, as Andreas Dilger pointed today in ext4
> > interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
> > 1/3-1/2 page space.
> >
> > What was the originally intention to set up slabs for committed_data(and
> > frozen_buffer) in JBD? Why not using kmalloc?
> >
> > Mingming
>
> Looks good. Small suggestion is to get rid of all kmalloc() usages and
> consistently use jbd_kmalloc() or jbd2_kmalloc().
>
> Thanks,
> Badari
>

Here is the incremental small cleanup patch.

Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.


Signed-off-by: Mingming Cao <[email protected]>
---
fs/jbd/journal.c | 8 +++++---
fs/jbd/revoke.c | 12 ++++++------
fs/jbd2/journal.c | 8 +++++---
fs/jbd2/revoke.c | 12 ++++++------
4 files changed, 22 insertions(+), 18 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-17 14:32:16.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-17 14:33:59.000000000 -0700
@@ -723,7 +723,8 @@ journal_t * journal_init_dev(struct bloc
journal->j_blocksize = blocksize;
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
- journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+ GFP_KERNEL);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
@@ -777,7 +778,8 @@ journal_t * journal_init_inode (struct i
/* journal descriptor can store up to n blocks -bzzz */
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
- journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+ GFP_KERNEL);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
@@ -1157,7 +1159,7 @@ void journal_destroy(journal_t *journal)
iput(journal->j_inode);
if (journal->j_revoke)
journal_destroy_revoke(journal);
- kfree(journal->j_wbuf);
+ jbd_kfree(journal->j_wbuf);
jbd_kfree(journal);
}

Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-17 14:32:22.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/revoke.c 2007-09-17 14:35:13.000000000 -0700
@@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
journal->j_revoke->hash_shift = shift;

journal->j_revoke->hash_table =
- kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+ jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
if (!journal->j_revoke->hash_table) {
kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
journal->j_revoke = NULL;
@@ -231,7 +231,7 @@ int journal_init_revoke(journal_t *journ

journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
if (!journal->j_revoke_table[1]) {
- kfree(journal->j_revoke_table[0]->hash_table);
+ jbd_kfree(journal->j_revoke_table[0]->hash_table);
kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
return -ENOMEM;
}
@@ -246,9 +246,9 @@ int journal_init_revoke(journal_t *journ
journal->j_revoke->hash_shift = shift;

journal->j_revoke->hash_table =
- kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+ jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
if (!journal->j_revoke->hash_table) {
- kfree(journal->j_revoke_table[0]->hash_table);
+ jbd_kfree(journal->j_revoke_table[0]->hash_table);
kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
kmem_cache_free(revoke_table_cache, journal->j_revoke_table[1]);
journal->j_revoke = NULL;
@@ -280,7 +280,7 @@ void journal_destroy_revoke(journal_t *j
J_ASSERT (list_empty(hash_list));
}

- kfree(table->hash_table);
+ jbd_kfree(table->hash_table);
kmem_cache_free(revoke_table_cache, table);
journal->j_revoke = NULL;

@@ -293,7 +293,7 @@ void journal_destroy_revoke(journal_t *j
J_ASSERT (list_empty(hash_list));
}

- kfree(table->hash_table);
+ jbd_kfree(table->hash_table);
kmem_cache_free(revoke_table_cache, table);
journal->j_revoke = NULL;
}
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-17 14:32:39.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-17 14:53:15.000000000 -0700
@@ -724,7 +724,8 @@ journal_t * jbd2_journal_init_dev(struct
journal->j_blocksize = blocksize;
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
- journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ journal->j_wbuf = jbd2_kmalloc(n * sizeof(struct buffer_head*),
+ GFP_KERNEL);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
@@ -778,7 +779,8 @@ journal_t * jbd2_journal_init_inode (str
/* journal descriptor can store up to n blocks -bzzz */
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
- journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ journal->j_wbuf = jbd2_kmalloc(n * sizeof(struct buffer_head*),
+ GFP_KERNEL);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
@@ -1158,7 +1160,7 @@ void jbd2_journal_destroy(journal_t *jou
iput(journal->j_inode);
if (journal->j_revoke)
jbd2_journal_destroy_revoke(journal);
- kfree(journal->j_wbuf);
+ jbd2_kfree(journal->j_wbuf);
jbd2_kfree(journal);
}

Index: linux-2.6.23-rc6/fs/jbd2/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/revoke.c 2007-09-17 14:32:34.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/revoke.c 2007-09-17 14:55:35.000000000 -0700
@@ -220,7 +220,7 @@ int jbd2_journal_init_revoke(journal_t *
journal->j_revoke->hash_shift = shift;

journal->j_revoke->hash_table =
- kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+ jbd2_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
if (!journal->j_revoke->hash_table) {
kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
journal->j_revoke = NULL;
@@ -232,7 +232,7 @@ int jbd2_journal_init_revoke(journal_t *

journal->j_revoke_table[1] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_KERNEL);
if (!journal->j_revoke_table[1]) {
- kfree(journal->j_revoke_table[0]->hash_table);
+ jbd2_kfree(journal->j_revoke_table[0]->hash_table);
kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
return -ENOMEM;
}
@@ -247,9 +247,9 @@ int jbd2_journal_init_revoke(journal_t *
journal->j_revoke->hash_shift = shift;

journal->j_revoke->hash_table =
- kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+ jbd2_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
if (!journal->j_revoke->hash_table) {
- kfree(journal->j_revoke_table[0]->hash_table);
+ jbd2_kfree(journal->j_revoke_table[0]->hash_table);
kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[1]);
journal->j_revoke = NULL;
@@ -281,7 +281,7 @@ void jbd2_journal_destroy_revoke(journal
J_ASSERT (list_empty(hash_list));
}

- kfree(table->hash_table);
+ jbd2_kfree(table->hash_table);
kmem_cache_free(jbd2_revoke_table_cache, table);
journal->j_revoke = NULL;

@@ -294,7 +294,7 @@ void jbd2_journal_destroy_revoke(journal
J_ASSERT (list_empty(hash_list));
}

- kfree(table->hash_table);
+ jbd2_kfree(table->hash_table);
kmem_cache_free(jbd2_revoke_table_cache, table);
journal->j_revoke = NULL;
}





2007-09-18 09:04:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
> Here is the incremental small cleanup patch.
>
> Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.

Shouldn't we kill jbd_kmalloc instead?

2007-09-18 16:36:04

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
> On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
> > Here is the incremental small cleanup patch.
> >
> > Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.
>
> Shouldn't we kill jbd_kmalloc instead?
>

It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
places to handle memory (de)allocation(<page size) via kmalloc/kfree, so
in the future if we need to change memory allocation in jbd(e.g. not
using kmalloc or using different flag), we don't need to touch every
place in the jbd code calling jbd_kmalloc.

Regards,
Mingming

2007-09-18 18:05:19

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote:
> On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
> > On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
> > > Here is the incremental small cleanup patch.
> > >
> > > Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.
> >
> > Shouldn't we kill jbd_kmalloc instead?
> >
>
> It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
> places to handle memory (de)allocation(<page size) via kmalloc/kfree, so
> in the future if we need to change memory allocation in jbd(e.g. not
> using kmalloc or using different flag), we don't need to touch every
> place in the jbd code calling jbd_kmalloc.

I disagree. Why would jbd need to globally change the way it allocates
memory? It currently uses kmalloc (and jbd_kmalloc) for allocating a
variety of structures. Having to change one particular instance won't
necessarily mean we want to change all of them. Adding unnecessary
wrappers only obfuscates the code making it harder to understand. You
wouldn't want every subsystem to have it's own *_kmalloc() that took
different arguments. Besides, there aren't that many calls to kmalloc
and kfree in the jbd code, so there wouldn't be much pain in changing
GFP flags or whatever, if it ever needed to be done.

Shaggy
--
David Kleikamp
IBM Linux Technology Center

2007-09-19 01:00:25

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Tue, 2007-09-18 at 13:04 -0500, Dave Kleikamp wrote:
> On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote:
> > On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
> > > On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
> > > > Here is the incremental small cleanup patch.
> > > >
> > > > Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.
> > >
> > > Shouldn't we kill jbd_kmalloc instead?
> > >
> >
> > It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
> > places to handle memory (de)allocation(<page size) via kmalloc/kfree, so
> > in the future if we need to change memory allocation in jbd(e.g. not
> > using kmalloc or using different flag), we don't need to touch every
> > place in the jbd code calling jbd_kmalloc.
>
> I disagree. Why would jbd need to globally change the way it allocates
> memory? It currently uses kmalloc (and jbd_kmalloc) for allocating a
> variety of structures. Having to change one particular instance won't
> necessarily mean we want to change all of them. Adding unnecessary
> wrappers only obfuscates the code making it harder to understand. You
> wouldn't want every subsystem to have it's own *_kmalloc() that took
> different arguments. Besides, there aren't that many calls to kmalloc
> and kfree in the jbd code, so there wouldn't be much pain in changing
> GFP flags or whatever, if it ever needed to be done.
>
> Shaggy

Okay, Points taken, Here is the updated patch to get rid of slab
management and jbd_kmalloc from jbd totally. This patch is intend to
replace the patch in mm tree, Andrew, could you pick up this one
instead?

Thanks,

Mingming


jbd/jbd2: JBD memory allocation cleanups

From: Christoph Lameter <[email protected]>

JBD: Replace slab allocations with page cache allocations

JBD allocate memory for committed_data and frozen_data from slab. However
JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.


Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly

Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>

---
fs/jbd/commit.c | 6 +--
fs/jbd/journal.c | 99 ++------------------------------------------------
fs/jbd/transaction.c | 12 +++---
fs/jbd2/commit.c | 6 +--
fs/jbd2/journal.c | 99 ++------------------------------------------------
fs/jbd2/transaction.c | 18 ++++-----
include/linux/jbd.h | 18 +++++----
include/linux/jbd2.h | 21 +++++-----
8 files changed, 52 insertions(+), 227 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-18 17:51:21.000000000 -0700
@@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);

static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
static void __journal_abort_soft (journal_t *journal, int errno);
-static int journal_create_jbd_slab(size_t slab_size);

/*
* Helper function used to manage commit timeouts
@@ -334,10 +333,10 @@ repeat:
char *tmp;

jbd_unlock_bh_state(bh_in);
- tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS);
+ tmp = jbd_alloc(bh_in->b_size, GFP_NOFS);
jbd_lock_bh_state(bh_in);
if (jh_in->b_frozen_data) {
- jbd_slab_free(tmp, bh_in->b_size);
+ jbd_free(tmp, bh_in->b_size);
goto repeat;
}

@@ -654,7 +653,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;

- journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
+ journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
@@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
}
}

- /*
- * Create a slab for this blocksize
- */
- err = journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
- if (err)
- return err;
-
/* Let the recovery code check whether it needs to recover any
* data from the journal. */
if (journal_recover(journal))
@@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
}

/*
- * Simple support for retrying memory allocations. Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
- return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size) (size >> 11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
- "jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
-};
-
-static void journal_destroy_jbd_slabs(void)
-{
- int i;
-
- for (i = 0; i < JBD_MAX_SLABS; i++) {
- if (jbd_slab[i])
- kmem_cache_destroy(jbd_slab[i]);
- jbd_slab[i] = NULL;
- }
-}
-
-static int journal_create_jbd_slab(size_t slab_size)
-{
- int i = JBD_SLAB_INDEX(slab_size);
-
- BUG_ON(i >= JBD_MAX_SLABS);
-
- /*
- * Check if we already have a slab created for this size
- */
- if (jbd_slab[i])
- return 0;
-
- /*
- * Create a slab and force alignment to be same as slabsize -
- * this will make sure that allocations won't cross the page
- * boundary.
- */
- jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
- slab_size, slab_size, 0, NULL);
- if (!jbd_slab[i]) {
- printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
- return -ENOMEM;
- }
- return 0;
-}
-
-void * jbd_slab_alloc(size_t size, gfp_t flags)
-{
- int idx;
-
- idx = JBD_SLAB_INDEX(size);
- BUG_ON(jbd_slab[idx] == NULL);
- return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
-}
-
-void jbd_slab_free(void *ptr, size_t size)
-{
- int idx;
-
- idx = JBD_SLAB_INDEX(size);
- BUG_ON(jbd_slab[idx] == NULL);
- kmem_cache_free(jbd_slab[idx], ptr);
-}
-
-/*
* Journal_head storage management
*/
static struct kmem_cache *journal_head_cache;
@@ -1881,13 +1793,13 @@ static void __journal_remove_journal_hea
printk(KERN_WARNING "%s: freeing "
"b_frozen_data\n",
__FUNCTION__);
- jbd_slab_free(jh->b_frozen_data, bh->b_size);
+ jbd_free(jh->b_frozen_data, bh->b_size);
}
if (jh->b_committed_data) {
printk(KERN_WARNING "%s: freeing "
"b_committed_data\n",
__FUNCTION__);
- jbd_slab_free(jh->b_committed_data, bh->b_size);
+ jbd_free(jh->b_committed_data, bh->b_size);
}
bh->b_private = NULL;
jh->b_bh = NULL; /* debug, really */
@@ -2042,7 +1954,6 @@ static void journal_destroy_caches(void)
journal_destroy_revoke_caches();
journal_destroy_journal_head_cache();
journal_destroy_handle_cache();
- journal_destroy_jbd_slabs();
}

static int __init journal_init(void)
Index: linux-2.6.23-rc6/include/linux/jbd.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/jbd.h 2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/include/linux/jbd.h 2007-09-18 17:51:21.000000000 -0700
@@ -71,14 +71,16 @@ extern int journal_enable_debug;
#define jbd_debug(f, a...) /**/
#endif

-extern void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
-extern void * jbd_slab_alloc(size_t size, gfp_t flags);
-extern void jbd_slab_free(void *ptr, size_t size);
-
-#define jbd_kmalloc(size, flags) \
- __jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
-#define jbd_rep_kmalloc(size, flags) \
- __jbd_kmalloc(__FUNCTION__, (size), (flags), 1)
+
+static inline void *jbd_alloc(size_t size, gfp_t flags)
+{
+ return (void *)__get_free_pages(flags, get_order(size));
+}
+
+static inline void jbd_free(void *ptr, size_t size)
+{
+ free_pages((unsigned long)ptr, get_order(size));
+};

#define JFS_MIN_JOURNAL_BLOCKS 1024

Index: linux-2.6.23-rc6/include/linux/jbd2.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/jbd2.h 2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/include/linux/jbd2.h 2007-09-18 17:51:21.000000000 -0700
@@ -71,14 +71,15 @@ extern u8 jbd2_journal_enable_debug;
#define jbd_debug(f, a...) /**/
#endif

-extern void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
-extern void * jbd2_slab_alloc(size_t size, gfp_t flags);
-extern void jbd2_slab_free(void *ptr, size_t size);
-
-#define jbd_kmalloc(size, flags) \
- __jbd2_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
-#define jbd_rep_kmalloc(size, flags) \
- __jbd2_kmalloc(__FUNCTION__, (size), (flags), 1)
+static inline void *jbd2_alloc(size_t size, gfp_t flags)
+{
+ return (void *)__get_free_pages(flags, get_order(size));
+}
+
+static inline void jbd2_free(void *ptr, size_t size)
+{
+ free_pages((unsigned long)ptr, get_order(size));
+};

#define JBD2_MIN_JOURNAL_BLOCKS 1024

@@ -959,12 +960,12 @@ void jbd2_journal_put_journal_head(struc
*/
extern struct kmem_cache *jbd2_handle_cache;

-static inline handle_t *jbd_alloc_handle(gfp_t gfp_flags)
+static inline handle_t *jbd2_alloc_handle(gfp_t gfp_flags)
{
return kmem_cache_alloc(jbd2_handle_cache, gfp_flags);
}

-static inline void jbd_free_handle(handle_t *handle)
+static inline void jbd2_free_handle(handle_t *handle)
{
kmem_cache_free(jbd2_handle_cache, handle);
}
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-18 17:51:21.000000000 -0700
@@ -84,7 +84,6 @@ EXPORT_SYMBOL(jbd2_journal_force_commit)

static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
static void __journal_abort_soft (journal_t *journal, int errno);
-static int jbd2_journal_create_jbd_slab(size_t slab_size);

/*
* Helper function used to manage commit timeouts
@@ -335,10 +334,10 @@ repeat:
char *tmp;

jbd_unlock_bh_state(bh_in);
- tmp = jbd2_slab_alloc(bh_in->b_size, GFP_NOFS);
+ tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
jbd_lock_bh_state(bh_in);
if (jh_in->b_frozen_data) {
- jbd2_slab_free(tmp, bh_in->b_size);
+ jbd2_free(tmp, bh_in->b_size);
goto repeat;
}

@@ -655,7 +654,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;

- journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
+ journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
@@ -1096,13 +1095,6 @@ int jbd2_journal_load(journal_t *journal
}
}

- /*
- * Create a slab for this blocksize
- */
- err = jbd2_journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
- if (err)
- return err;
-
/* Let the recovery code check whether it needs to recover any
* data from the journal. */
if (jbd2_journal_recover(journal))
@@ -1627,86 +1619,6 @@ size_t journal_tag_bytes(journal_t *jour
}

/*
- * Simple support for retrying memory allocations. Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
- return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size) (size >> 11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
- "jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k"
-};
-
-static void jbd2_journal_destroy_jbd_slabs(void)
-{
- int i;
-
- for (i = 0; i < JBD_MAX_SLABS; i++) {
- if (jbd_slab[i])
- kmem_cache_destroy(jbd_slab[i]);
- jbd_slab[i] = NULL;
- }
-}
-
-static int jbd2_journal_create_jbd_slab(size_t slab_size)
-{
- int i = JBD_SLAB_INDEX(slab_size);
-
- BUG_ON(i >= JBD_MAX_SLABS);
-
- /*
- * Check if we already have a slab created for this size
- */
- if (jbd_slab[i])
- return 0;
-
- /*
- * Create a slab and force alignment to be same as slabsize -
- * this will make sure that allocations won't cross the page
- * boundary.
- */
- jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
- slab_size, slab_size, 0, NULL);
- if (!jbd_slab[i]) {
- printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
- return -ENOMEM;
- }
- return 0;
-}
-
-void * jbd2_slab_alloc(size_t size, gfp_t flags)
-{
- int idx;
-
- idx = JBD_SLAB_INDEX(size);
- BUG_ON(jbd_slab[idx] == NULL);
- return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
-}
-
-void jbd2_slab_free(void *ptr, size_t size)
-{
- int idx;
-
- idx = JBD_SLAB_INDEX(size);
- BUG_ON(jbd_slab[idx] == NULL);
- kmem_cache_free(jbd_slab[idx], ptr);
-}
-
-/*
* Journal_head storage management
*/
static struct kmem_cache *jbd2_journal_head_cache;
@@ -1893,13 +1805,13 @@ static void __journal_remove_journal_hea
printk(KERN_WARNING "%s: freeing "
"b_frozen_data\n",
__FUNCTION__);
- jbd2_slab_free(jh->b_frozen_data, bh->b_size);
+ jbd2_free(jh->b_frozen_data, bh->b_size);
}
if (jh->b_committed_data) {
printk(KERN_WARNING "%s: freeing "
"b_committed_data\n",
__FUNCTION__);
- jbd2_slab_free(jh->b_committed_data, bh->b_size);
+ jbd2_free(jh->b_committed_data, bh->b_size);
}
bh->b_private = NULL;
jh->b_bh = NULL; /* debug, really */
@@ -2040,7 +1952,6 @@ static void jbd2_journal_destroy_caches(
jbd2_journal_destroy_revoke_caches();
jbd2_journal_destroy_jbd2_journal_head_cache();
jbd2_journal_destroy_handle_cache();
- jbd2_journal_destroy_jbd_slabs();
}

static int __init journal_init(void)
Index: linux-2.6.23-rc6/fs/jbd/commit.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/commit.c 2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/commit.c 2007-09-18 17:23:26.000000000 -0700
@@ -375,7 +375,7 @@ void journal_commit_transaction(journal_
struct buffer_head *bh = jh2bh(jh);

jbd_lock_bh_state(bh);
- jbd_slab_free(jh->b_committed_data, bh->b_size);
+ jbd_free(jh->b_committed_data, bh->b_size);
jh->b_committed_data = NULL;
jbd_unlock_bh_state(bh);
}
@@ -792,14 +792,14 @@ restart_loop:
* Otherwise, we can just throw away the frozen data now.
*/
if (jh->b_committed_data) {
- jbd_slab_free(jh->b_committed_data, bh->b_size);
+ jbd_free(jh->b_committed_data, bh->b_size);
jh->b_committed_data = NULL;
if (jh->b_frozen_data) {
jh->b_committed_data = jh->b_frozen_data;
jh->b_frozen_data = NULL;
}
} else if (jh->b_frozen_data) {
- jbd_slab_free(jh->b_frozen_data, bh->b_size);
+ jbd_free(jh->b_frozen_data, bh->b_size);
jh->b_frozen_data = NULL;
}

Index: linux-2.6.23-rc6/fs/jbd2/commit.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/commit.c 2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/commit.c 2007-09-18 17:23:26.000000000 -0700
@@ -384,7 +384,7 @@ void jbd2_journal_commit_transaction(jou
struct buffer_head *bh = jh2bh(jh);

jbd_lock_bh_state(bh);
- jbd2_slab_free(jh->b_committed_data, bh->b_size);
+ jbd2_free(jh->b_committed_data, bh->b_size);
jh->b_committed_data = NULL;
jbd_unlock_bh_state(bh);
}
@@ -801,14 +801,14 @@ restart_loop:
* Otherwise, we can just throw away the frozen data now.
*/
if (jh->b_committed_data) {
- jbd2_slab_free(jh->b_committed_data, bh->b_size);
+ jbd2_free(jh->b_committed_data, bh->b_size);
jh->b_committed_data = NULL;
if (jh->b_frozen_data) {
jh->b_committed_data = jh->b_frozen_data;
jh->b_frozen_data = NULL;
}
} else if (jh->b_frozen_data) {
- jbd2_slab_free(jh->b_frozen_data, bh->b_size);
+ jbd2_free(jh->b_frozen_data, bh->b_size);
jh->b_frozen_data = NULL;
}

Index: linux-2.6.23-rc6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/transaction.c 2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/transaction.c 2007-09-18 17:51:21.000000000 -0700
@@ -96,8 +96,8 @@ static int start_this_handle(journal_t *

alloc_transaction:
if (!journal->j_running_transaction) {
- new_transaction = jbd_kmalloc(sizeof(*new_transaction),
- GFP_NOFS);
+ new_transaction = kmalloc(sizeof(*new_transaction),
+ GFP_NOFS|__GFP_NOFAIL);
if (!new_transaction) {
ret = -ENOMEM;
goto out;
@@ -668,7 +668,7 @@ repeat:
JBUFFER_TRACE(jh, "allocate memory for buffer");
jbd_unlock_bh_state(bh);
frozen_buffer =
- jbd_slab_alloc(jh2bh(jh)->b_size,
+ jbd_alloc(jh2bh(jh)->b_size,
GFP_NOFS);
if (!frozen_buffer) {
printk(KERN_EMERG
@@ -728,7 +728,7 @@ done:

out:
if (unlikely(frozen_buffer)) /* It's usually NULL */
- jbd_slab_free(frozen_buffer, bh->b_size);
+ jbd_free(frozen_buffer, bh->b_size);

JBUFFER_TRACE(jh, "exit");
return error;
@@ -881,7 +881,7 @@ int journal_get_undo_access(handle_t *ha

repeat:
if (!jh->b_committed_data) {
- committed_data = jbd_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+ committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS);
if (!committed_data) {
printk(KERN_EMERG "%s: No memory for committed data\n",
__FUNCTION__);
@@ -908,7 +908,7 @@ repeat:
out:
journal_put_journal_head(jh);
if (unlikely(committed_data))
- jbd_slab_free(committed_data, bh->b_size);
+ jbd_free(committed_data, bh->b_size);
return err;
}

Index: linux-2.6.23-rc6/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c 2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/transaction.c 2007-09-18 17:51:21.000000000 -0700
@@ -96,8 +96,8 @@ static int start_this_handle(journal_t *

alloc_transaction:
if (!journal->j_running_transaction) {
- new_transaction = jbd_kmalloc(sizeof(*new_transaction),
- GFP_NOFS);
+ new_transaction = kmalloc(sizeof(*new_transaction),
+ GFP_NOFS|__GFP_NOFAIL);
if (!new_transaction) {
ret = -ENOMEM;
goto out;
@@ -236,7 +236,7 @@ out:
/* Allocate a new handle. This should probably be in a slab... */
static handle_t *new_handle(int nblocks)
{
- handle_t *handle = jbd_alloc_handle(GFP_NOFS);
+ handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
if (!handle)
return NULL;
memset(handle, 0, sizeof(*handle));
@@ -282,7 +282,7 @@ handle_t *jbd2_journal_start(journal_t *

err = start_this_handle(journal, handle);
if (err < 0) {
- jbd_free_handle(handle);
+ jbd2_free_handle(handle);
current->journal_info = NULL;
handle = ERR_PTR(err);
}
@@ -668,7 +668,7 @@ repeat:
JBUFFER_TRACE(jh, "allocate memory for buffer");
jbd_unlock_bh_state(bh);
frozen_buffer =
- jbd2_slab_alloc(jh2bh(jh)->b_size,
+ jbd2_alloc(jh2bh(jh)->b_size,
GFP_NOFS);
if (!frozen_buffer) {
printk(KERN_EMERG
@@ -728,7 +728,7 @@ done:

out:
if (unlikely(frozen_buffer)) /* It's usually NULL */
- jbd2_slab_free(frozen_buffer, bh->b_size);
+ jbd2_free(frozen_buffer, bh->b_size);

JBUFFER_TRACE(jh, "exit");
return error;
@@ -881,7 +881,7 @@ int jbd2_journal_get_undo_access(handle_

repeat:
if (!jh->b_committed_data) {
- committed_data = jbd2_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+ committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
if (!committed_data) {
printk(KERN_EMERG "%s: No memory for committed data\n",
__FUNCTION__);
@@ -908,7 +908,7 @@ repeat:
out:
jbd2_journal_put_journal_head(jh);
if (unlikely(committed_data))
- jbd2_slab_free(committed_data, bh->b_size);
+ jbd2_free(committed_data, bh->b_size);
return err;
}

@@ -1411,7 +1411,7 @@ int jbd2_journal_stop(handle_t *handle)
spin_unlock(&journal->j_state_lock);
}

- jbd_free_handle(handle);
+ jbd2_free_handle(handle);
return err;
}



2007-09-19 02:20:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao <[email protected]> wrote:

> JBD: Replace slab allocations with page cache allocations
>
> JBD allocate memory for committed_data and frozen_data from slab. However
> JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.
>
>
> Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly

__GFP_NOFAIL should only be used when we have no way of recovering
from failure. The allocation in journal_init_common() (at least)
_can_ recover and hence really shouldn't be using __GFP_NOFAIL.

(Actually, nothing in the kernel should be using __GFP_NOFAIL. It is
there as a marker which says "we really shouldn't be doing this but
we don't know how to fix it").

So sometime it'd be good if you could review all the __GFP_NOFAILs in
there and see if we can remove some, thanks.

2007-09-19 19:16:15

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Tue, 2007-09-18 at 19:19 -0700, Andrew Morton wrote:
> On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao <[email protected]> wrote:
>
> > JBD: Replace slab allocations with page cache allocations
> >
> > JBD allocate memory for committed_data and frozen_data from slab. However
> > JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.
> >
> >
> > Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly
>
> __GFP_NOFAIL should only be used when we have no way of recovering
> from failure. The allocation in journal_init_common() (at least)
> _can_ recover and hence really shouldn't be using __GFP_NOFAIL.
>
> (Actually, nothing in the kernel should be using __GFP_NOFAIL. It is
> there as a marker which says "we really shouldn't be doing this but
> we don't know how to fix it").
>
> So sometime it'd be good if you could review all the __GFP_NOFAILs in
> there and see if we can remove some, thanks.

Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
cases except one handles memory allocation failure so I get rid of those
GFP_NOFAIL flags.

Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
in jbd/jbd2? I will send a separate patch to cleanup that.

Signed-off-by: Mingming Cao <[email protected]>
---
fs/jbd/journal.c | 2 +-
fs/jbd/transaction.c | 3 +--
fs/jbd2/journal.c | 2 +-
fs/jbd2/transaction.c | 3 +--
4 files changed, 4 insertions(+), 6 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:47:58.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:48:40.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;

- journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+ journal = kmalloc(sizeof(*journal), GFP_KERNEL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/transaction.c 2007-09-19 11:48:05.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/transaction.c 2007-09-19 11:49:10.000000000 -0700
@@ -96,8 +96,7 @@ static int start_this_handle(journal_t *

alloc_transaction:
if (!journal->j_running_transaction) {
- new_transaction = kmalloc(sizeof(*new_transaction),
- GFP_NOFS|__GFP_NOFAIL);
+ new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
if (!new_transaction) {
ret = -ENOMEM;
goto out;
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:48:14.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 11:49:46.000000000 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;

- journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+ journal = kmalloc(sizeof(*journal), GFP_KERNEL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c 2007-09-19 11:48:08.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/transaction.c 2007-09-19 11:50:12.000000000 -0700
@@ -96,8 +96,7 @@ static int start_this_handle(journal_t *

alloc_transaction:
if (!journal->j_running_transaction) {
- new_transaction = kmalloc(sizeof(*new_transaction),
- GFP_NOFS|__GFP_NOFAIL);
+ new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
if (!new_transaction) {
ret = -ENOMEM;
goto out;




2007-09-19 19:22:27

by Mingming Cao

[permalink] [raw]
Subject: [PATCH] JBD: use GFP_NOFS in kmalloc

Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
with the rest of kmalloc flag used in the JBD/JBD2 layer.

Signed-off-by: Mingming Cao <[email protected]>

---
fs/jbd/journal.c | 6 +++---
fs/jbd/revoke.c | 8 ++++----
fs/jbd2/journal.c | 6 +++---
fs/jbd2/revoke.c | 8 ++++----
4 files changed, 14 insertions(+), 14 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:51:10.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;

- journal = kmalloc(sizeof(*journal), GFP_KERNEL);
+ journal = kmalloc(sizeof(*journal), GFP_NOFS);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
@@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
journal->j_blocksize = blocksize;
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
- journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
@@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
/* journal descriptor can store up to n blocks -bzzz */
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
- journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/revoke.c 2007-09-19 11:52:34.000000000 -0700
@@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
while((tmp >>= 1UL) != 0UL)
shift++;

- journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
+ journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
if (!journal->j_revoke_table[0])
return -ENOMEM;
journal->j_revoke = journal->j_revoke_table[0];
@@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
journal->j_revoke->hash_shift = shift;

journal->j_revoke->hash_table =
- kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+ kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
if (!journal->j_revoke->hash_table) {
kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
journal->j_revoke = NULL;
@@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
for (tmp = 0; tmp < hash_size; tmp++)
INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);

- journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
+ journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
if (!journal->j_revoke_table[1]) {
kfree(journal->j_revoke_table[0]->hash_table);
kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
@@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
journal->j_revoke->hash_shift = shift;

journal->j_revoke->hash_table =
- kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+ kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
if (!journal->j_revoke->hash_table) {
kfree(journal->j_revoke_table[0]->hash_table);
kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:52:48.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 11:53:12.000000000 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;

- journal = kmalloc(sizeof(*journal), GFP_KERNEL);
+ journal = kmalloc(sizeof(*journal), GFP_NOFS);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
@@ -724,7 +724,7 @@ journal_t * jbd2_journal_init_dev(struct
journal->j_blocksize = blocksize;
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
- journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
@@ -778,7 +778,7 @@ journal_t * jbd2_journal_init_inode (str
/* journal descriptor can store up to n blocks -bzzz */
n = journal->j_blocksize / sizeof(journal_block_tag_t);
journal->j_wbufsize = n;
- journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+ journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
if (!journal->j_wbuf) {
printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
__FUNCTION__);
Index: linux-2.6.23-rc6/fs/jbd2/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/revoke.c 2007-09-19 11:52:53.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/revoke.c 2007-09-19 11:53:32.000000000 -0700
@@ -207,7 +207,7 @@ int jbd2_journal_init_revoke(journal_t *
while((tmp >>= 1UL) != 0UL)
shift++;

- journal->j_revoke_table[0] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_KERNEL);
+ journal->j_revoke_table[0] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_NOFS);
if (!journal->j_revoke_table[0])
return -ENOMEM;
journal->j_revoke = journal->j_revoke_table[0];
@@ -220,7 +220,7 @@ int jbd2_journal_init_revoke(journal_t *
journal->j_revoke->hash_shift = shift;

journal->j_revoke->hash_table =
- kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+ kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
if (!journal->j_revoke->hash_table) {
kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
journal->j_revoke = NULL;
@@ -230,7 +230,7 @@ int jbd2_journal_init_revoke(journal_t *
for (tmp = 0; tmp < hash_size; tmp++)
INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);

- journal->j_revoke_table[1] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_KERNEL);
+ journal->j_revoke_table[1] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_NOFS);
if (!journal->j_revoke_table[1]) {
kfree(journal->j_revoke_table[0]->hash_table);
kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
@@ -247,7 +247,7 @@ int jbd2_journal_init_revoke(journal_t *
journal->j_revoke->hash_shift = shift;

journal->j_revoke->hash_table =
- kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+ kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
if (!journal->j_revoke->hash_table) {
kfree(journal->j_revoke_table[0]->hash_table);
kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);


2007-09-19 19:26:53

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:

> Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
> cases except one handles memory allocation failure so I get rid of those
> GFP_NOFAIL flags.
>
> Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
> in jbd/jbd2? I will send a separate patch to cleanup that.

No. GFP_NOFS avoids deadlock. It prevents the allocation from making
recursive calls back into the file system that could end up blocking on
jbd code.

Shaggy
--
David Kleikamp
IBM Linux Technology Center

2007-09-19 19:29:13

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote:
> On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:
>
> > Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
> > cases except one handles memory allocation failure so I get rid of those
> > GFP_NOFAIL flags.
> >
> > Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
> > in jbd/jbd2? I will send a separate patch to cleanup that.
>
> No. GFP_NOFS avoids deadlock. It prevents the allocation from making
> recursive calls back into the file system that could end up blocking on
> jbd code.

Oh, I see your patch now. You mean use GFP_NOFS instead of
GFP_KERNEL. :-) OK then.

> Shaggy
--
David Kleikamp
IBM Linux Technology Center

2007-09-19 19:48:00

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Sep 19, 2007 12:15 -0700, Mingming Cao wrote:
> @@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
>
> alloc_transaction:
> if (!journal->j_running_transaction) {
> - new_transaction = kmalloc(sizeof(*new_transaction),
> - GFP_NOFS|__GFP_NOFAIL);
> + new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);

This should probably be a __GFP_NOFAIL if we are trying to start a new
handle in truncate, as there is no way to propagate an error to the caller.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-09-19 20:47:34

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Wed, 2007-09-19 at 19:28 +0000, Dave Kleikamp wrote:
> On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote:
> > On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:
> >
> > > Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
> > > cases except one handles memory allocation failure so I get rid of those
> > > GFP_NOFAIL flags.
> > >
> > > Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
> > > in jbd/jbd2? I will send a separate patch to cleanup that.
> >
> > No. GFP_NOFS avoids deadlock. It prevents the allocation from making
> > recursive calls back into the file system that could end up blocking on
> > jbd code.
>
> Oh, I see your patch now. You mean use GFP_NOFS instead of
> GFP_KERNEL. :-) OK then.
>

oops, I did mean what you say here.:-)

> > Shaggy

2007-09-19 21:35:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] JBD: use GFP_NOFS in kmalloc

On Wed, 19 Sep 2007 12:22:09 -0700
Mingming Cao <[email protected]> wrote:

> Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
> with the rest of kmalloc flag used in the JBD/JBD2 layer.
>
> Signed-off-by: Mingming Cao <[email protected]>
>
> ---
> fs/jbd/journal.c | 6 +++---
> fs/jbd/revoke.c | 8 ++++----
> fs/jbd2/journal.c | 6 +++---
> fs/jbd2/revoke.c | 8 ++++----
> 4 files changed, 14 insertions(+), 14 deletions(-)
>
> Index: linux-2.6.23-rc6/fs/jbd/journal.c
> ===================================================================
> --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:51:10.000000000 -0700
> +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.000000000 -0700
> @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
> journal_t *journal;
> int err;
>
> - journal = kmalloc(sizeof(*journal), GFP_KERNEL);
> + journal = kmalloc(sizeof(*journal), GFP_NOFS);
> if (!journal)
> goto fail;
> memset(journal, 0, sizeof(*journal));
> @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
> journal->j_blocksize = blocksize;
> n = journal->j_blocksize / sizeof(journal_block_tag_t);
> journal->j_wbufsize = n;
> - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> if (!journal->j_wbuf) {
> printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> __FUNCTION__);
> @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
> /* journal descriptor can store up to n blocks -bzzz */
> n = journal->j_blocksize / sizeof(journal_block_tag_t);
> journal->j_wbufsize = n;
> - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> if (!journal->j_wbuf) {
> printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> __FUNCTION__);
> Index: linux-2.6.23-rc6/fs/jbd/revoke.c
> ===================================================================
> --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.000000000 -0700
> +++ linux-2.6.23-rc6/fs/jbd/revoke.c 2007-09-19 11:52:34.000000000 -0700
> @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
> while((tmp >>= 1UL) != 0UL)
> shift++;
>
> - journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
> + journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
> if (!journal->j_revoke_table[0])
> return -ENOMEM;
> journal->j_revoke = journal->j_revoke_table[0];
> @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
> journal->j_revoke->hash_shift = shift;
>
> journal->j_revoke->hash_table =
> - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
> if (!journal->j_revoke->hash_table) {
> kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> journal->j_revoke = NULL;
> @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
> for (tmp = 0; tmp < hash_size; tmp++)
> INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
>
> - journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
> + journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
> if (!journal->j_revoke_table[1]) {
> kfree(journal->j_revoke_table[0]->hash_table);
> kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
> journal->j_revoke->hash_shift = shift;
>
> journal->j_revoke->hash_table =
> - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
> if (!journal->j_revoke->hash_table) {
> kfree(journal->j_revoke_table[0]->hash_table);
> kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);

These were all OK using GFP_KERNEL.

GFP_NOFS should only be used when the caller is holding some fs locks which
might cause a deadlock if that caller reentered the fs in ->writepage (and
maybe put_inode and such). That isn't the case in any of the above code,
which is all mount time stuff (I think).

ext3/4 should be using GFP_NOFS when the caller has a transaction open, has
a page locked, is holding i_mutex, etc.

2007-09-19 21:55:21

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH] JBD: use GFP_NOFS in kmalloc

On Wed, 2007-09-19 at 14:34 -0700, Andrew Morton wrote:
> On Wed, 19 Sep 2007 12:22:09 -0700
> Mingming Cao <[email protected]> wrote:
>
> > Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
> > with the rest of kmalloc flag used in the JBD/JBD2 layer.
> >
> > Signed-off-by: Mingming Cao <[email protected]>
> >
> > ---
> > fs/jbd/journal.c | 6 +++---
> > fs/jbd/revoke.c | 8 ++++----
> > fs/jbd2/journal.c | 6 +++---
> > fs/jbd2/revoke.c | 8 ++++----
> > 4 files changed, 14 insertions(+), 14 deletions(-)
> >
> > Index: linux-2.6.23-rc6/fs/jbd/journal.c
> > ===================================================================
> > --- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:51:10.000000000 -0700
> > +++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 11:51:57.000000000 -0700
> > @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
> > journal_t *journal;
> > int err;
> >
> > - journal = kmalloc(sizeof(*journal), GFP_KERNEL);
> > + journal = kmalloc(sizeof(*journal), GFP_NOFS);
> > if (!journal)
> > goto fail;
> > memset(journal, 0, sizeof(*journal));
> > @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
> > journal->j_blocksize = blocksize;
> > n = journal->j_blocksize / sizeof(journal_block_tag_t);
> > journal->j_wbufsize = n;
> > - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> > + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> > if (!journal->j_wbuf) {
> > printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> > __FUNCTION__);
> > @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
> > /* journal descriptor can store up to n blocks -bzzz */
> > n = journal->j_blocksize / sizeof(journal_block_tag_t);
> > journal->j_wbufsize = n;
> > - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> > + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> > if (!journal->j_wbuf) {
> > printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> > __FUNCTION__);
> > Index: linux-2.6.23-rc6/fs/jbd/revoke.c
> > ===================================================================
> > --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c 2007-09-19 11:51:30.000000000 -0700
> > +++ linux-2.6.23-rc6/fs/jbd/revoke.c 2007-09-19 11:52:34.000000000 -0700
> > @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
> > while((tmp >>= 1UL) != 0UL)
> > shift++;
> >
> > - journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
> > + journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
> > if (!journal->j_revoke_table[0])
> > return -ENOMEM;
> > journal->j_revoke = journal->j_revoke_table[0];
> > @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
> > journal->j_revoke->hash_shift = shift;
> >
> > journal->j_revoke->hash_table =
> > - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> > + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
> > if (!journal->j_revoke->hash_table) {
> > kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> > journal->j_revoke = NULL;
> > @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
> > for (tmp = 0; tmp < hash_size; tmp++)
> > INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
> >
> > - journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
> > + journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
> > if (!journal->j_revoke_table[1]) {
> > kfree(journal->j_revoke_table[0]->hash_table);
> > kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> > @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
> > journal->j_revoke->hash_shift = shift;
> >
> > journal->j_revoke->hash_table =
> > - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> > + kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
> > if (!journal->j_revoke->hash_table) {
> > kfree(journal->j_revoke_table[0]->hash_table);
> > kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
>
> These were all OK using GFP_KERNEL.
>
> GFP_NOFS should only be used when the caller is holding some fs locks which
> might cause a deadlock if that caller reentered the fs in ->writepage (and
> maybe put_inode and such). That isn't the case in any of the above code,
> which is all mount time stuff (I think).
>

You are right they are all occur at initialization time.

> ext3/4 should be using GFP_NOFS when the caller has a transaction open, has
> a page locked, is holding i_mutex, etc.
>

Thanks for your feedback.

Mingming

2007-09-19 22:04:15

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH] JBD slab cleanups

On Wed, 2007-09-19 at 13:48 -0600, Andreas Dilger wrote:
> On Sep 19, 2007 12:15 -0700, Mingming Cao wrote:
> > @@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
> >
> > alloc_transaction:
> > if (!journal->j_running_transaction) {
> > - new_transaction = kmalloc(sizeof(*new_transaction),
> > - GFP_NOFS|__GFP_NOFAIL);
> > + new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
>
> This should probably be a __GFP_NOFAIL if we are trying to start a new
> handle in truncate, as there is no way to propagate an error to the caller.
>

Thanks, updated version.

Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2, most cases
they are not needed.

Signed-off-by: Mingming Cao <[email protected]>
---
fs/jbd/journal.c | 2 +-
fs/jbd2/journal.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-19 11:47:58.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-19 14:23:45.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;

- journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+ journal = kmalloc(sizeof(*journal), GFP_KERNEL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:48:14.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-19 14:23:45.000000000 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;

- journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+ journal = kmalloc(sizeof(*journal), GFP_KERNEL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));


2007-09-20 04:25:12

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] JBD: use GFP_NOFS in kmalloc

On Sep 19, 2007 12:22 -0700, Mingming Cao wrote:
> Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
> with the rest of kmalloc flag used in the JBD/JBD2 layer.
>
> @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
> - journal = kmalloc(sizeof(*journal), GFP_KERNEL);
> + journal = kmalloc(sizeof(*journal), GFP_NOFS);
> @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
> - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
> - journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> + journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);

Is there a reason for this change except "it's in a filesystem, so it
should be GFP_NOFS"? We are only doing journal setup during mount so
there shouldn't be any problem using GFP_KERNEL. I don't think it will
inject any defect into the code, but I don't think it is needed either.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-09-21 23:14:17

by Mingming Cao

[permalink] [raw]
Subject: [PATCH] JBD/ext34 cleanups: convert to kzalloc

Convert kmalloc to kzalloc() and get rid of the memset().

Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext3/xattr.c | 3 +--
fs/ext4/xattr.c | 3 +--
fs/jbd/journal.c | 3 +--
fs/jbd/transaction.c | 2 +-
fs/jbd2/journal.c | 3 +--
fs/jbd2/transaction.c | 2 +-
6 files changed, 6 insertions(+), 10 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c 2007-09-21 09:08:02.000000000
-0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c 2007-09-21 09:10:37.000000000
-0700
@@ -653,10 +653,9 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;

- journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+ journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
if (!journal)
goto fail;
- memset(journal, 0, sizeof(*journal));

init_waitqueue_head(&journal->j_wait_transaction_locked);
init_waitqueue_head(&journal->j_wait_logspace);
Index: linux-2.6.23-rc6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/transaction.c 2007-09-21
09:13:11.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/transaction.c 2007-09-21 09:13:24.000000000
-0700
@@ -96,7 +96,7 @@ static int start_this_handle(journal_t *

alloc_transaction:
if (!journal->j_running_transaction) {
- new_transaction = kmalloc(sizeof(*new_transaction),
+ new_transaction = kzalloc(sizeof(*new_transaction),
GFP_NOFS|__GFP_NOFAIL);
if (!new_transaction) {
ret = -ENOMEM;
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-21
09:10:53.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-21 09:11:13.000000000
-0700
@@ -654,10 +654,9 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;

- journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+ journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
if (!journal)
goto fail;
- memset(journal, 0, sizeof(*journal));

init_waitqueue_head(&journal->j_wait_transaction_locked);
init_waitqueue_head(&journal->j_wait_logspace);
Index: linux-2.6.23-rc6/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c 2007-09-21
09:12:46.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/transaction.c 2007-09-21 09:12:59.000000000
-0700
@@ -96,7 +96,7 @@ static int start_this_handle(journal_t *

alloc_transaction:
if (!journal->j_running_transaction) {
- new_transaction = kmalloc(sizeof(*new_transaction),
+ new_transaction = kzalloc(sizeof(*new_transaction),
GFP_NOFS|__GFP_NOFAIL);
if (!new_transaction) {
ret = -ENOMEM;
Index: linux-2.6.23-rc6/fs/ext3/xattr.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext3/xattr.c 2007-09-21 10:22:24.000000000
-0700
+++ linux-2.6.23-rc6/fs/ext3/xattr.c 2007-09-21 10:24:19.000000000 -0700
@@ -741,12 +741,11 @@ ext3_xattr_block_set(handle_t *handle, s
}
} else {
/* Allocate a buffer where we construct the new block. */
- s->base = kmalloc(sb->s_blocksize, GFP_KERNEL);
+ s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
/* assert(header == s->base) */
error = -ENOMEM;
if (s->base == NULL)
goto cleanup;
- memset(s->base, 0, sb->s_blocksize);
header(s->base)->h_magic = cpu_to_le32(EXT3_XATTR_MAGIC);
header(s->base)->h_blocks = cpu_to_le32(1);
header(s->base)->h_refcount = cpu_to_le32(1);
Index: linux-2.6.23-rc6/fs/ext4/xattr.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext4/xattr.c 2007-09-21 10:20:21.000000000
-0700
+++ linux-2.6.23-rc6/fs/ext4/xattr.c 2007-09-21 10:21:00.000000000 -0700
@@ -750,12 +750,11 @@ ext4_xattr_block_set(handle_t *handle, s
}
} else {
/* Allocate a buffer where we construct the new block. */
- s->base = kmalloc(sb->s_blocksize, GFP_KERNEL);
+ s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
/* assert(header == s->base) */
error = -ENOMEM;
if (s->base == NULL)
goto cleanup;
- memset(s->base, 0, sb->s_blocksize);
header(s->base)->h_magic = cpu_to_le32(EXT4_XATTR_MAGIC);
header(s->base)->h_blocks = cpu_to_le32(1);
header(s->base)->h_refcount = cpu_to_le32(1);


2007-09-21 23:32:30

by Mingming Cao

[permalink] [raw]
Subject: [PATCH] JBD2/ext4 naming cleanup

JBD2 naming cleanup

From: Mingming Cao <[email protected]>

change micros name from JBD_XXX to JBD2_XXX in JBD2/Ext4

Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext4/extents.c | 2 +-
fs/ext4/super.c | 2 +-
fs/jbd2/commit.c | 2 +-
fs/jbd2/journal.c | 8 ++++----
fs/jbd2/recovery.c | 2 +-
fs/jbd2/revoke.c | 4 ++--
include/linux/ext4_jbd2.h | 6 +++---
include/linux/jbd2.h | 30 +++++++++++++++---------------
8 files changed, 28 insertions(+), 28 deletions(-)

Index: linux-2.6.23-rc6/fs/ext4/super.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext4/super.c 2007-09-21 16:27:31.000000000 -0700
+++ linux-2.6.23-rc6/fs/ext4/super.c 2007-09-21 16:27:46.000000000 -0700
@@ -966,7 +966,7 @@ static int parse_options (char *options,
if (option < 0)
return 0;
if (option == 0)
- option = JBD_DEFAULT_MAX_COMMIT_AGE;
+ option = JBD2_DEFAULT_MAX_COMMIT_AGE;
sbi->s_commit_interval = HZ * option;
break;
case Opt_data_journal:
Index: linux-2.6.23-rc6/include/linux/ext4_jbd2.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/ext4_jbd2.h 2007-09-10 19:50:29.000000000 -0700
+++ linux-2.6.23-rc6/include/linux/ext4_jbd2.h 2007-09-21 16:27:46.000000000 -0700
@@ -12,8 +12,8 @@
* Ext4-specific journaling extensions.
*/

-#ifndef _LINUX_EXT4_JBD_H
-#define _LINUX_EXT4_JBD_H
+#ifndef _LINUX_EXT4_JBD2_H
+#define _LINUX_EXT4_JBD2_H

#include <linux/fs.h>
#include <linux/jbd2.h>
@@ -228,4 +228,4 @@ static inline int ext4_should_writeback_
return 0;
}

-#endif /* _LINUX_EXT4_JBD_H */
+#endif /* _LINUX_EXT4_JBD2_H */
Index: linux-2.6.23-rc6/include/linux/jbd2.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/jbd2.h 2007-09-21 09:07:09.000000000 -0700
+++ linux-2.6.23-rc6/include/linux/jbd2.h 2007-09-21 16:27:46.000000000 -0700
@@ -13,8 +13,8 @@
* filesystem journaling support.
*/

-#ifndef _LINUX_JBD_H
-#define _LINUX_JBD_H
+#ifndef _LINUX_JBD2_H
+#define _LINUX_JBD2_H

/* Allow this file to be included directly into e2fsprogs */
#ifndef __KERNEL__
@@ -37,26 +37,26 @@
#define journal_oom_retry 1

/*
- * Define JBD_PARANIOD_IOFAIL to cause a kernel BUG() if ext3 finds
+ * Define JBD2_PARANIOD_IOFAIL to cause a kernel BUG() if ext4 finds
* certain classes of error which can occur due to failed IOs. Under
- * normal use we want ext3 to continue after such errors, because
+ * normal use we want ext4 to continue after such errors, because
* hardware _can_ fail, but for debugging purposes when running tests on
* known-good hardware we may want to trap these errors.
*/
-#undef JBD_PARANOID_IOFAIL
+#undef JBD2_PARANOID_IOFAIL

/*
* The default maximum commit age, in seconds.
*/
-#define JBD_DEFAULT_MAX_COMMIT_AGE 5
+#define JBD2_DEFAULT_MAX_COMMIT_AGE 5

#ifdef CONFIG_JBD2_DEBUG
/*
- * Define JBD_EXPENSIVE_CHECKING to enable more expensive internal
+ * Define JBD2_EXPENSIVE_CHECKING to enable more expensive internal
* consistency checks. By default we don't do this unless
* CONFIG_JBD2_DEBUG is on.
*/
-#define JBD_EXPENSIVE_CHECKING
+#define JBD2_EXPENSIVE_CHECKING
extern u8 jbd2_journal_enable_debug;

#define jbd_debug(n, f, a...) \
@@ -163,8 +163,8 @@ typedef struct journal_block_tag_s
__be32 t_blocknr_high; /* most-significant high 32bits. */
} journal_block_tag_t;

-#define JBD_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
-#define JBD_TAG_SIZE64 (sizeof(journal_block_tag_t))
+#define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
+#define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t))

/*
* The revoke descriptor: used on disk to describe a series of blocks to
@@ -256,8 +256,8 @@ typedef struct journal_superblock_s
#include <linux/fs.h>
#include <linux/sched.h>

-#define JBD_ASSERTIONS
-#ifdef JBD_ASSERTIONS
+#define JBD2_ASSERTIONS
+#ifdef JBD2_ASSERTIONS
#define J_ASSERT(assert) \
do { \
if (!(assert)) { \
@@ -284,9 +284,9 @@ void buffer_assertion_failure(struct buf

#else
#define J_ASSERT(assert) do { } while (0)
-#endif /* JBD_ASSERTIONS */
+#endif /* JBD2_ASSERTIONS */

-#if defined(JBD_PARANOID_IOFAIL)
+#if defined(JBD2_PARANOID_IOFAIL)
#define J_EXPECT(expr, why...) J_ASSERT(expr)
#define J_EXPECT_BH(bh, expr, why...) J_ASSERT_BH(bh, expr)
#define J_EXPECT_JH(jh, expr, why...) J_ASSERT_JH(jh, expr)
@@ -1104,4 +1104,4 @@ extern int jbd_blocks_per_page(struct in

#endif /* __KERNEL__ */

-#endif /* _LINUX_JBD_H */
+#endif /* _LINUX_JBD2_H */
Index: linux-2.6.23-rc6/fs/jbd2/commit.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/commit.c 2007-09-21 09:07:09.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/commit.c 2007-09-21 16:27:46.000000000 -0700
@@ -278,7 +278,7 @@ static inline void write_tag_block(int t
unsigned long long block)
{
tag->t_blocknr = cpu_to_be32(block & (u32)~0);
- if (tag_bytes > JBD_TAG_SIZE32)
+ if (tag_bytes > JBD2_TAG_SIZE32)
tag->t_blocknr_high = cpu_to_be32((block >> 31) >> 1);
}

Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-21 16:25:46.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c 2007-09-21 16:27:46.000000000 -0700
@@ -670,7 +670,7 @@ static journal_t * journal_init_common (
spin_lock_init(&journal->j_list_lock);
spin_lock_init(&journal->j_state_lock);

- journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE);
+ journal->j_commit_interval = (HZ * JBD2_DEFAULT_MAX_COMMIT_AGE);

/* The journal is marked for error until we succeed with recovery! */
journal->j_flags = JBD2_ABORT;
@@ -1612,9 +1612,9 @@ int jbd2_journal_blocks_per_page(struct
size_t journal_tag_bytes(journal_t *journal)
{
if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT))
- return JBD_TAG_SIZE64;
+ return JBD2_TAG_SIZE64;
else
- return JBD_TAG_SIZE32;
+ return JBD2_TAG_SIZE32;
}

/*
@@ -1681,7 +1681,7 @@ static void journal_free_journal_head(st
{
#ifdef CONFIG_JBD2_DEBUG
atomic_dec(&nr_journal_heads);
- memset(jh, JBD_POISON_FREE, sizeof(*jh));
+ memset(jh, JBD2_POISON_FREE, sizeof(*jh));
#endif
kmem_cache_free(jbd2_journal_head_cache, jh);
}
Index: linux-2.6.23-rc6/fs/jbd2/recovery.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/recovery.c 2007-09-21 09:07:05.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/recovery.c 2007-09-21 16:27:46.000000000 -0700
@@ -311,7 +311,7 @@ int jbd2_journal_skip_recovery(journal_t
static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag_t *tag)
{
unsigned long long block = be32_to_cpu(tag->t_blocknr);
- if (tag_bytes > JBD_TAG_SIZE32)
+ if (tag_bytes > JBD2_TAG_SIZE32)
block |= (u64)be32_to_cpu(tag->t_blocknr_high) << 32;
return block;
}
Index: linux-2.6.23-rc6/fs/jbd2/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/revoke.c 2007-09-19 14:23:45.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/revoke.c 2007-09-21 16:27:46.000000000 -0700
@@ -352,7 +352,7 @@ int jbd2_journal_revoke(handle_t *handle
if (bh)
BUFFER_TRACE(bh, "found on hash");
}
-#ifdef JBD_EXPENSIVE_CHECKING
+#ifdef JBD2_EXPENSIVE_CHECKING
else {
struct buffer_head *bh2;

@@ -453,7 +453,7 @@ int jbd2_journal_cancel_revoke(handle_t
}
}

-#ifdef JBD_EXPENSIVE_CHECKING
+#ifdef JBD2_EXPENSIVE_CHECKING
/* There better not be one left behind by now! */
record = find_revoke_record(journal, bh->b_blocknr);
J_ASSERT_JH(jh, record == NULL);
Index: linux-2.6.23-rc6/fs/ext4/extents.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext4/extents.c 2007-09-21 09:07:04.000000000 -0700
+++ linux-2.6.23-rc6/fs/ext4/extents.c 2007-09-21 16:27:46.000000000 -0700
@@ -33,7 +33,7 @@
#include <linux/fs.h>
#include <linux/time.h>
#include <linux/ext4_jbd2.h>
-#include <linux/jbd.h>
+#include <linux/jbd2.h>
#include <linux/highuid.h>
#include <linux/pagemap.h>
#include <linux/quotaops.h>


2007-09-26 19:54:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] JBD/ext34 cleanups: convert to kzalloc

On Fri, 21 Sep 2007 16:13:56 -0700
Mingming Cao <[email protected]> wrote:

> Convert kmalloc to kzalloc() and get rid of the memset().

I split this into separate ext3/jbd and ext4/jbd2 patches. It's generally
better to raise separate patches, please - the ext3 patches I'll merge
directly but the ext4 patches should go through (and be against) the ext4
devel tree.

I fixed lots of rejects against the already-pending changes to these
filesystems.

You forgot to remove the memsets in both start_this_handle()s.

2007-09-26 21:06:38

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH] JBD/ext34 cleanups: convert to kzalloc

On Wed, 2007-09-26 at 12:54 -0700, Andrew Morton wrote:
> On Fri, 21 Sep 2007 16:13:56 -0700
> Mingming Cao <[email protected]> wrote:
>
> > Convert kmalloc to kzalloc() and get rid of the memset().
>
> I split this into separate ext3/jbd and ext4/jbd2 patches. It's generally
> better to raise separate patches, please - the ext3 patches I'll merge
> directly but the ext4 patches should go through (and be against) the ext4
> devel tree.
>
Sure. The patches(including ext3/jbd and ext4/jbd2) were merged into
ext4 devel tree already, I will remove the ext3/jbd part out of the ext4
devel tree.

> I fixed lots of rejects against the already-pending changes to these
> filesystems.
>
> You forgot to remove the memsets in both start_this_handle()s.
>
Thanks for catching this.

Mingming

2007-10-02 00:35:19

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 1/2] ext4: Support large blocksize up to PAGESIZE

Support large blocksize up to PAGESIZE (max 64KB) for ext4.

From: Takashi Sato <[email protected]>

This patch set supports large block size(>4k, <=64k) in ext4,
just enlarging the block size limit. But it is NOT possible to have 64kB
blocksize on ext4 without some changes to the directory handling
code. The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem. The proposed solution is treat 64k rec_len
with a an impossible value like rec_len = 0xffff to handle this.

The Patch-set consists of the following 2 patches.
[1/2] ext4: enlarge blocksize
- Allow blocksize up to pagesize

[2/2] ext4: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize

Now on 64k page ppc64 box runs with this patch set we could create a 64k
block size ext4dev, and able to handle empty directory block.
Patch consider to be merge to 2.6.24-rc1.

Signed-off-by: Takashi Sato <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---

fs/ext4/super.c | 5 +++++
include/linux/ext4_fs.h | 4 ++--
2 files changed, 7 insertions(+), 2 deletions(-)


diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 619db84..d8bb279 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1548,6 +1548,11 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
goto out_fail;
}

+ if (!sb_set_blocksize(sb, blocksize)) {
+ printk(KERN_ERR "EXT4-fs: bad blocksize %d.\n", blocksize);
+ goto out_fail;
+ }
+
/*
* The ext4 superblock will not be buffer aligned for other than 1kB
* block sizes. We need to calculate the offset from buffer start.
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index f9881b6..d15a15e 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -77,8 +77,8 @@
* Macro-instructions used to manage several block sizes
*/
#define EXT4_MIN_BLOCK_SIZE 1024
-#define EXT4_MAX_BLOCK_SIZE 4096
-#define EXT4_MIN_BLOCK_LOG_SIZE 10
+#define EXT4_MAX_BLOCK_SIZE 65536
+#define EXT4_MIN_BLOCK_LOG_SIZE 10
#ifdef __KERNEL__
# define EXT4_BLOCK_SIZE(s) ((s)->s_blocksize)
#else


2007-10-02 00:41:46

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 2/2] ext4: Avoid rec_len overflow with 64KB block size

ext4: Avoid rec_len overflow with 64KB block size

From: Jan Kara <[email protected]>

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk. The patch also converts some places
to use ext4_next_entry() when we are changing them anyway.

Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---

fs/ext4/dir.c | 12 ++++---
fs/ext4/namei.c | 76 ++++++++++++++++++++++-------------------------
include/linux/ext4_fs.h | 20 ++++++++++++
3 files changed, 62 insertions(+), 46 deletions(-)


diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 3ab01c0..20b1e28 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -69,7 +69,7 @@ int ext4_check_dir_entry (const char * function, struct inode * dir,
unsigned long offset)
{
const char * error_msg = NULL;
- const int rlen = le16_to_cpu(de->rec_len);
+ const int rlen = ext4_rec_len_from_disk(de->rec_len);

if (rlen < EXT4_DIR_REC_LEN(1))
error_msg = "rec_len is smaller than minimal";
@@ -176,10 +176,10 @@ revalidate:
* least that it is non-zero. A
* failure will be detected in the
* dirent test below. */
- if (le16_to_cpu(de->rec_len) <
- EXT4_DIR_REC_LEN(1))
+ if (ext4_rec_len_from_disk(de->rec_len)
+ < EXT4_DIR_REC_LEN(1))
break;
- i += le16_to_cpu(de->rec_len);
+ i += ext4_rec_len_from_disk(de->rec_len);
}
offset = i;
filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1))
@@ -201,7 +201,7 @@ revalidate:
ret = stored;
goto out;
}
- offset += le16_to_cpu(de->rec_len);
+ offset += ext4_rec_len_from_disk(de->rec_len);
if (le32_to_cpu(de->inode)) {
/* We might block in the next section
* if the data destination is
@@ -223,7 +223,7 @@ revalidate:
goto revalidate;
stored ++;
}
- filp->f_pos += le16_to_cpu(de->rec_len);
+ filp->f_pos += ext4_rec_len_from_disk(de->rec_len);
}
offset = 0;
brelse (bh);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 5fdb862..96e8a85 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -281,7 +281,7 @@ static struct stats dx_show_leaf(struct dx_hash_info *hinfo, struct ext4_dir_ent
space += EXT4_DIR_REC_LEN(de->name_len);
names++;
}
- de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+ de = ext4_next_entry(de);
}
printk("(%i)\n", names);
return (struct stats) { names, space, 1 };
@@ -552,7 +552,8 @@ static int ext4_htree_next_block(struct inode *dir, __u32 hash,
*/
static inline struct ext4_dir_entry_2 *ext4_next_entry(struct ext4_dir_entry_2 *p)
{
- return (struct ext4_dir_entry_2 *)((char*)p + le16_to_cpu(p->rec_len));
+ return (struct ext4_dir_entry_2 *)((char*)p +
+ ext4_rec_len_from_disk(p->rec_len));
}

/*
@@ -721,7 +722,7 @@ static int dx_make_map (struct ext4_dir_entry_2 *de, int size,
cond_resched();
}
/* XXX: do we need to check rec_len == 0 case? -Chris */
- de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+ de = ext4_next_entry(de);
}
return count;
}
@@ -823,7 +824,7 @@ static inline int search_dirblock(struct buffer_head * bh,
return 1;
}
/* prevent looping on a bad block */
- de_len = le16_to_cpu(de->rec_len);
+ de_len = ext4_rec_len_from_disk(de->rec_len);
if (de_len <= 0)
return -1;
offset += de_len;
@@ -1136,7 +1137,7 @@ dx_move_dirents(char *from, char *to, struct dx_map_entry *map, int count)
rec_len = EXT4_DIR_REC_LEN(de->name_len);
memcpy (to, de, rec_len);
((struct ext4_dir_entry_2 *) to)->rec_len =
- cpu_to_le16(rec_len);
+ ext4_rec_len_to_disk(rec_len);
de->inode = 0;
map++;
to += rec_len;
@@ -1155,13 +1156,12 @@ static struct ext4_dir_entry_2* dx_pack_dirents(char *base, int size)

prev = to = de;
while ((char*)de < base + size) {
- next = (struct ext4_dir_entry_2 *) ((char *) de +
- le16_to_cpu(de->rec_len));
+ next = ext4_next_entry(de);
if (de->inode && de->name_len) {
rec_len = EXT4_DIR_REC_LEN(de->name_len);
if (de > to)
memmove(to, de, rec_len);
- to->rec_len = cpu_to_le16(rec_len);
+ to->rec_len = ext4_rec_len_to_disk(rec_len);
prev = to;
to = (struct ext4_dir_entry_2 *) (((char *) to) + rec_len);
}
@@ -1235,8 +1235,8 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
/* Fancy dance to stay within two buffers */
de2 = dx_move_dirents(data1, data2, map + split, count - split);
de = dx_pack_dirents(data1,blocksize);
- de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
- de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+ de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de);
+ de2->rec_len = ext4_rec_len_to_disk(data2 + blocksize - (char *) de2);
dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data1, blocksize, 1));
dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data2, blocksize, 1));

@@ -1307,7 +1307,7 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
return -EEXIST;
}
nlen = EXT4_DIR_REC_LEN(de->name_len);
- rlen = le16_to_cpu(de->rec_len);
+ rlen = ext4_rec_len_from_disk(de->rec_len);
if ((de->inode? rlen - nlen: rlen) >= reclen)
break;
de = (struct ext4_dir_entry_2 *)((char *)de + rlen);
@@ -1326,11 +1326,11 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,

/* By now the buffer is marked for journaling */
nlen = EXT4_DIR_REC_LEN(de->name_len);
- rlen = le16_to_cpu(de->rec_len);
+ rlen = ext4_rec_len_from_disk(de->rec_len);
if (de->inode) {
struct ext4_dir_entry_2 *de1 = (struct ext4_dir_entry_2 *)((char *)de + nlen);
- de1->rec_len = cpu_to_le16(rlen - nlen);
- de->rec_len = cpu_to_le16(nlen);
+ de1->rec_len = ext4_rec_len_to_disk(rlen - nlen);
+ de->rec_len = ext4_rec_len_to_disk(nlen);
de = de1;
}
de->file_type = EXT4_FT_UNKNOWN;
@@ -1408,17 +1408,18 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,

/* The 0th block becomes the root, move the dirents out */
fde = &root->dotdot;
- de = (struct ext4_dir_entry_2 *)((char *)fde + le16_to_cpu(fde->rec_len));
+ de = (struct ext4_dir_entry_2 *)((char *)fde +
+ ext4_rec_len_from_disk(fde->rec_len));
len = ((char *) root) + blocksize - (char *) de;
memcpy (data1, de, len);
de = (struct ext4_dir_entry_2 *) data1;
top = data1 + len;
- while ((char *)(de2=(void*)de+le16_to_cpu(de->rec_len)) < top)
+ while ((char *)(de2 = ext4_next_entry(de)) < top)
de = de2;
- de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+ de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de);
/* Initialize the root; the dot dirents already exist */
de = (struct ext4_dir_entry_2 *) (&root->dotdot);
- de->rec_len = cpu_to_le16(blocksize - EXT4_DIR_REC_LEN(2));
+ de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(2));
memset (&root->info, 0, sizeof(root->info));
root->info.info_length = sizeof(root->info);
root->info.hash_version = EXT4_SB(dir->i_sb)->s_def_hash_version;
@@ -1505,7 +1506,7 @@ static int ext4_add_entry (handle_t *handle, struct dentry *dentry,
return retval;
de = (struct ext4_dir_entry_2 *) bh->b_data;
de->inode = 0;
- de->rec_len = cpu_to_le16(blocksize);
+ de->rec_len = ext4_rec_len_to_disk(blocksize);
return add_dirent_to_buf(handle, dentry, inode, de, bh);
}

@@ -1569,7 +1570,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
goto cleanup;
node2 = (struct dx_node *)(bh2->b_data);
entries2 = node2->entries;
- node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
+ node2->fake.rec_len = ext4_rec_len_to_disk(sb->s_blocksize);
node2->fake.inode = 0;
BUFFER_TRACE(frame->bh, "get_write_access");
err = ext4_journal_get_write_access(handle, frame->bh);
@@ -1668,9 +1669,9 @@ static int ext4_delete_entry (handle_t *handle,
BUFFER_TRACE(bh, "get_write_access");
ext4_journal_get_write_access(handle, bh);
if (pde)
- pde->rec_len =
- cpu_to_le16(le16_to_cpu(pde->rec_len) +
- le16_to_cpu(de->rec_len));
+ pde->rec_len = ext4_rec_len_to_disk(
+ ext4_rec_len_from_disk(pde->rec_len) +
+ ext4_rec_len_from_disk(de->rec_len));
else
de->inode = 0;
dir->i_version++;
@@ -1678,10 +1679,9 @@ static int ext4_delete_entry (handle_t *handle,
ext4_journal_dirty_metadata(handle, bh);
return 0;
}
- i += le16_to_cpu(de->rec_len);
+ i += ext4_rec_len_from_disk(de->rec_len);
pde = de;
- de = (struct ext4_dir_entry_2 *)
- ((char *) de + le16_to_cpu(de->rec_len));
+ de = ext4_next_entry(de);
}
return -ENOENT;
}
@@ -1844,13 +1844,12 @@ retry:
de = (struct ext4_dir_entry_2 *) dir_block->b_data;
de->inode = cpu_to_le32(inode->i_ino);
de->name_len = 1;
- de->rec_len = cpu_to_le16(EXT4_DIR_REC_LEN(de->name_len));
+ de->rec_len = ext4_rec_len_to_disk(EXT4_DIR_REC_LEN(de->name_len));
strcpy (de->name, ".");
ext4_set_de_type(dir->i_sb, de, S_IFDIR);
- de = (struct ext4_dir_entry_2 *)
- ((char *) de + le16_to_cpu(de->rec_len));
+ de = ext4_next_entry(de);
de->inode = cpu_to_le32(dir->i_ino);
- de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize-EXT4_DIR_REC_LEN(1));
+ de->rec_len = ext4_rec_len_to_disk(inode->i_sb->s_blocksize-EXT4_DIR_REC_LEN(1));
de->name_len = 2;
strcpy (de->name, "..");
ext4_set_de_type(dir->i_sb, de, S_IFDIR);
@@ -1902,8 +1901,7 @@ static int empty_dir (struct inode * inode)
return 1;
}
de = (struct ext4_dir_entry_2 *) bh->b_data;
- de1 = (struct ext4_dir_entry_2 *)
- ((char *) de + le16_to_cpu(de->rec_len));
+ de1 = ext4_next_entry(de);
if (le32_to_cpu(de->inode) != inode->i_ino ||
!le32_to_cpu(de1->inode) ||
strcmp (".", de->name) ||
@@ -1914,9 +1912,9 @@ static int empty_dir (struct inode * inode)
brelse (bh);
return 1;
}
- offset = le16_to_cpu(de->rec_len) + le16_to_cpu(de1->rec_len);
- de = (struct ext4_dir_entry_2 *)
- ((char *) de1 + le16_to_cpu(de1->rec_len));
+ offset = ext4_rec_len_from_disk(de->rec_len) +
+ ext4_rec_len_from_disk(de1->rec_len);
+ de = ext4_next_entry(de1);
while (offset < inode->i_size ) {
if (!bh ||
(void *) de >= (void *) (bh->b_data+sb->s_blocksize)) {
@@ -1945,9 +1943,8 @@ static int empty_dir (struct inode * inode)
brelse (bh);
return 0;
}
- offset += le16_to_cpu(de->rec_len);
- de = (struct ext4_dir_entry_2 *)
- ((char *) de + le16_to_cpu(de->rec_len));
+ offset += ext4_rec_len_from_disk(de->rec_len);
+ de = ext4_next_entry(de);
}
brelse (bh);
return 1;
@@ -2302,8 +2299,7 @@ retry:
}

#define PARENT_INO(buffer) \
- ((struct ext4_dir_entry_2 *) ((char *) buffer + \
- le16_to_cpu(((struct ext4_dir_entry_2 *) buffer)->rec_len)))->inode
+ (ext4_next_entry((struct ext4_dir_entry_2 *)(buffer))->inode)

/*
* Anybody can rename anything with this: the permission checks are left to the
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index d15a15e..e1caf0a 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -771,6 +771,26 @@ struct ext4_dir_entry_2 {
#define EXT4_DIR_ROUND (EXT4_DIR_PAD - 1)
#define EXT4_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT4_DIR_ROUND) & \
~EXT4_DIR_ROUND)
+#define EXT4_MAX_REC_LEN ((1<<16)-1)
+
+static inline unsigned ext4_rec_len_from_disk(__le16 dlen)
+{
+ unsigned len = le16_to_cpu(dlen);
+
+ if (len == EXT4_MAX_REC_LEN)
+ return 1 << 16;
+ return len;
+}
+
+static inline __le16 ext4_rec_len_to_disk(unsigned len)
+{
+ if (len == (1 << 16))
+ return cpu_to_le16(EXT4_MAX_REC_LEN);
+ else if (len > (1 << 16))
+ BUG();
+ return cpu_to_le16(len);
+}
+
/*
* Hash Tree Directory indexing
* (c) Daniel Phillips, 2001


2007-10-02 00:47:30

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 1/2] ext2: Support large blocksize up to PAGESIZE

Support large blocksize up to PAGESIZE (max 64KB) for ext2

From: Takashi Sato <[email protected]>

This patch set supports large block size(>4k, <=64k) in ext2,
just enlarging the block size limit. But it is NOT possible to have 64kB
blocksize on ext2 without some changes to the directory handling
code. The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem. The proposed solution is treat 64k rec_len
with a an impossible value like rec_len = 0xffff to handle this.

The Patch-set consists of the following 2 patches.
[1/2] ext2: enlarge blocksize
- Allow blocksize up to pagesize

[2/2] ext2: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize

Now on 64k page ppc64 box runs with this patch set we could create a 64k
block size ext2, and able to handle empty directory block.

Please consider to include to mm tree.

Signed-off-by: Takashi Sato <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---

fs/ext2/super.c | 3 ++-
include/linux/ext2_fs.h | 4 ++--
2 files changed, 4 insertions(+), 3 deletions(-)


diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 639a32c..765c805 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -775,7 +775,8 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
brelse(bh);

if (!sb_set_blocksize(sb, blocksize)) {
- printk(KERN_ERR "EXT2-fs: blocksize too small for device.\n");
+ printk(KERN_ERR "EXT2-fs: bad blocksize %d.\n",
+ blocksize);
goto failed_sbi;
}

diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
index 153d755..910a705 100644
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -86,8 +86,8 @@ static inline struct ext2_sb_info *EXT2_SB(struct super_block *sb)
* Macro-instructions used to manage several block sizes
*/
#define EXT2_MIN_BLOCK_SIZE 1024
-#define EXT2_MAX_BLOCK_SIZE 4096
-#define EXT2_MIN_BLOCK_LOG_SIZE 10
+#define EXT2_MAX_BLOCK_SIZE 65536
+#define EXT2_MIN_BLOCK_LOG_SIZE 10
#ifdef __KERNEL__
# define EXT2_BLOCK_SIZE(s) ((s)->s_blocksize)
#else


2007-10-02 00:50:40

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

ext2: Avoid rec_len overflow with 64KB block size

From: Jan Kara <[email protected]>

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk.

Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---

fs/ext2/dir.c | 43 +++++++++++++++++++++++++++++++------------
include/linux/ext2_fs.h | 1 +
2 files changed, 32 insertions(+), 12 deletions(-)


diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 2bf49d7..1329bdb 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -26,6 +26,24 @@

typedef struct ext2_dir_entry_2 ext2_dirent;

+static inline unsigned ext2_rec_len_from_disk(__le16 dlen)
+{
+ unsigned len = le16_to_cpu(dlen);
+
+ if (len == EXT2_MAX_REC_LEN)
+ return 1 << 16;
+ return len;
+}
+
+static inline __le16 ext2_rec_len_to_disk(unsigned len)
+{
+ if (len == (1 << 16))
+ return cpu_to_le16(EXT2_MAX_REC_LEN);
+ else if (len > (1 << 16))
+ BUG();
+ return cpu_to_le16(len);
+}
+
/*
* ext2 uses block-sized chunks. Arguably, sector-sized ones would be
* more robust, but we have what we have
@@ -95,7 +113,7 @@ static void ext2_check_page(struct page *page)
}
for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
p = (ext2_dirent *)(kaddr + offs);
- rec_len = le16_to_cpu(p->rec_len);
+ rec_len = ext2_rec_len_from_disk(p->rec_len);

if (rec_len < EXT2_DIR_REC_LEN(1))
goto Eshort;
@@ -193,7 +211,8 @@ static inline int ext2_match (int len, const char * const name,
*/
static inline ext2_dirent *ext2_next_entry(ext2_dirent *p)
{
- return (ext2_dirent *)((char*)p + le16_to_cpu(p->rec_len));
+ return (ext2_dirent *)((char*)p +
+ ext2_rec_len_from_disk(p->rec_len));
}

static inline unsigned
@@ -305,7 +324,7 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
return 0;
}
}
- filp->f_pos += le16_to_cpu(de->rec_len);
+ filp->f_pos += ext2_rec_len_from_disk(de->rec_len);
}
ext2_put_page(page);
}
@@ -413,7 +432,7 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
struct page *page, struct inode *inode)
{
unsigned from = (char *) de - (char *) page_address(page);
- unsigned to = from + le16_to_cpu(de->rec_len);
+ unsigned to = from + ext2_rec_len_from_disk(de->rec_len);
int err;

lock_page(page);
@@ -469,7 +488,7 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
/* We hit i_size */
name_len = 0;
rec_len = chunk_size;
- de->rec_len = cpu_to_le16(chunk_size);
+ de->rec_len = ext2_rec_len_to_disk(chunk_size);
de->inode = 0;
goto got_it;
}
@@ -483,7 +502,7 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
if (ext2_match (namelen, name, de))
goto out_unlock;
name_len = EXT2_DIR_REC_LEN(de->name_len);
- rec_len = le16_to_cpu(de->rec_len);
+ rec_len = ext2_rec_len_from_disk(de->rec_len);
if (!de->inode && rec_len >= reclen)
goto got_it;
if (rec_len >= name_len + reclen)
@@ -504,8 +523,8 @@ got_it:
goto out_unlock;
if (de->inode) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
- de1->rec_len = cpu_to_le16(rec_len - name_len);
- de->rec_len = cpu_to_le16(name_len);
+ de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
+ de->rec_len = ext2_rec_len_to_disk(name_len);
de = de1;
}
de->name_len = namelen;
@@ -536,7 +555,7 @@ int ext2_delete_entry (struct ext2_dir_entry_2 * dir, struct page * page )
struct inode *inode = mapping->host;
char *kaddr = page_address(page);
unsigned from = ((char*)dir - kaddr) & ~(ext2_chunk_size(inode)-1);
- unsigned to = ((char*)dir - kaddr) + le16_to_cpu(dir->rec_len);
+ unsigned to = ((char*)dir - kaddr) + ext2_rec_len_from_disk(dir->rec_len);
ext2_dirent * pde = NULL;
ext2_dirent * de = (ext2_dirent *) (kaddr + from);
int err;
@@ -557,7 +576,7 @@ int ext2_delete_entry (struct ext2_dir_entry_2 * dir, struct page * page )
err = mapping->a_ops->prepare_write(NULL, page, from, to);
BUG_ON(err);
if (pde)
- pde->rec_len = cpu_to_le16(to-from);
+ pde->rec_len = ext2_rec_len_to_disk(to-from);
dir->inode = 0;
err = ext2_commit_chunk(page, from, to);
inode->i_ctime = inode->i_mtime = CURRENT_TIME_SEC;
@@ -591,14 +610,14 @@ int ext2_make_empty(struct inode *inode, struct inode *parent)
memset(kaddr, 0, chunk_size);
de = (struct ext2_dir_entry_2 *)kaddr;
de->name_len = 1;
- de->rec_len = cpu_to_le16(EXT2_DIR_REC_LEN(1));
+ de->rec_len = ext2_rec_len_to_disk(EXT2_DIR_REC_LEN(1));
memcpy (de->name, ".\0\0", 4);
de->inode = cpu_to_le32(inode->i_ino);
ext2_set_de_type (de, inode);

de = (struct ext2_dir_entry_2 *)(kaddr + EXT2_DIR_REC_LEN(1));
de->name_len = 2;
- de->rec_len = cpu_to_le16(chunk_size - EXT2_DIR_REC_LEN(1));
+ de->rec_len = ext2_rec_len_to_disk(chunk_size - EXT2_DIR_REC_LEN(1));
de->inode = cpu_to_le32(parent->i_ino);
memcpy (de->name, "..\0", 4);
ext2_set_de_type (de, inode);
diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
index 910a705..41063d5 100644
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -557,5 +557,6 @@ enum {
#define EXT2_DIR_ROUND (EXT2_DIR_PAD - 1)
#define EXT2_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT2_DIR_ROUND) & \
~EXT2_DIR_ROUND)
+#define EXT2_MAX_REC_LEN ((1<<16)-1)

#endif /* _LINUX_EXT2_FS_H */


2007-10-02 00:53:55

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 1/2] ext3: Support large blocksize up to PAGESIZE

Support large blocksize up to PAGESIZE (max 64KB) for ext3

From: Takashi Sato <[email protected]>

This patch set supports large block size(>4k, <=64k) in ext3
just enlarging the block size limit. But it is NOT possible to have 64kB
blocksize on ext3 without some changes to the directory handling
code. The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem. The proposed solution is treat 64k rec_len
with a an impossible value like rec_len = 0xffff to handle this.

The Patch-set consists of the following 2 patches.
[1/2] ext3: enlarge blocksize
- Allow blocksize up to pagesize

[2/2] ext3: fix rec_len overflow
- prevent rec_len from overflow with 64KB blocksize

Now on 64k page ppc64 box runs with this patch set we could create a 64k
block size ext3, and able to handle empty directory block.

Signed-off-by: Takashi Sato <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---

fs/ext3/super.c | 6 +++++-
include/linux/ext3_fs.h | 4 ++--
2 files changed, 7 insertions(+), 3 deletions(-)


diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 9537316..b4bfd36 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1549,7 +1549,11 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
}

brelse (bh);
- sb_set_blocksize(sb, blocksize);
+ if (!sb_set_blocksize(sb, blocksize)) {
+ printk(KERN_ERR "EXT3-fs: bad blocksize %d.\n",
+ blocksize);
+ goto out_fail;
+ }
logic_sb_block = (sb_block * EXT3_MIN_BLOCK_SIZE) / blocksize;
offset = (sb_block * EXT3_MIN_BLOCK_SIZE) % blocksize;
bh = sb_bread(sb, logic_sb_block);
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index ece49a8..7aa5556 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -76,8 +76,8 @@
* Macro-instructions used to manage several block sizes
*/
#define EXT3_MIN_BLOCK_SIZE 1024
-#define EXT3_MAX_BLOCK_SIZE 4096
-#define EXT3_MIN_BLOCK_LOG_SIZE 10
+#define EXT3_MAX_BLOCK_SIZE 65536
+#define EXT3_MIN_BLOCK_LOG_SIZE 10
#ifdef __KERNEL__
# define EXT3_BLOCK_SIZE(s) ((s)->s_blocksize)
#else


2007-10-02 00:54:14

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 2/2] ext3: Avoid rec_len overflow with 64KB block size

ext3: Avoid rec_len overflow with 64KB block size

From: Jan Kara <[email protected]>

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk. The patch also converts some places
to use ext3_next_entry() when we are changing them anyway.

Signed-off-by: Jan Kara <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---

fs/ext3/dir.c | 10 +++--
fs/ext3/namei.c | 90 ++++++++++++++++++++++-------------------------
include/linux/ext3_fs.h | 20 ++++++++++
3 files changed, 68 insertions(+), 52 deletions(-)


diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index c00723a..3c4c43a 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -69,7 +69,7 @@ int ext3_check_dir_entry (const char * function, struct inode * dir,
unsigned long offset)
{
const char * error_msg = NULL;
- const int rlen = le16_to_cpu(de->rec_len);
+ const int rlen = ext3_rec_len_from_disk(de->rec_len);

if (rlen < EXT3_DIR_REC_LEN(1))
error_msg = "rec_len is smaller than minimal";
@@ -177,10 +177,10 @@ revalidate:
* least that it is non-zero. A
* failure will be detected in the
* dirent test below. */
- if (le16_to_cpu(de->rec_len) <
+ if (ext3_rec_len_from_disk(de->rec_len) <
EXT3_DIR_REC_LEN(1))
break;
- i += le16_to_cpu(de->rec_len);
+ i += ext3_rec_len_from_disk(de->rec_len);
}
offset = i;
filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1))
@@ -201,7 +201,7 @@ revalidate:
ret = stored;
goto out;
}
- offset += le16_to_cpu(de->rec_len);
+ offset += ext3_rec_len_from_disk(de->rec_len);
if (le32_to_cpu(de->inode)) {
/* We might block in the next section
* if the data destination is
@@ -223,7 +223,7 @@ revalidate:
goto revalidate;
stored ++;
}
- filp->f_pos += le16_to_cpu(de->rec_len);
+ filp->f_pos += ext3_rec_len_from_disk(de->rec_len);
}
offset = 0;
brelse (bh);
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index c1fa190..2c38eb6 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -144,6 +144,15 @@ struct dx_map_entry
u16 size;
};

+/*
+ * p is at least 6 bytes before the end of page
+ */
+static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p)
+{
+ return (struct ext3_dir_entry_2 *)((char*)p +
+ ext3_rec_len_from_disk(p->rec_len));
+}
+
#ifdef CONFIG_EXT3_INDEX
static inline unsigned dx_get_block (struct dx_entry *entry);
static void dx_set_block (struct dx_entry *entry, unsigned value);
@@ -281,7 +290,7 @@ static struct stats dx_show_leaf(struct dx_hash_info *hinfo, struct ext3_dir_ent
space += EXT3_DIR_REC_LEN(de->name_len);
names++;
}
- de = (struct ext3_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+ de = ext3_next_entry(de);
}
printk("(%i)\n", names);
return (struct stats) { names, space, 1 };
@@ -548,14 +557,6 @@ static int ext3_htree_next_block(struct inode *dir, __u32 hash,


/*
- * p is at least 6 bytes before the end of page
- */
-static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p)
-{
- return (struct ext3_dir_entry_2 *)((char*)p + le16_to_cpu(p->rec_len));
-}
-
-/*
* This function fills a red-black tree with information from a
* directory block. It returns the number directory entries loaded
* into the tree. If there is an error it is returned in err.
@@ -721,7 +722,7 @@ static int dx_make_map (struct ext3_dir_entry_2 *de, int size,
cond_resched();
}
/* XXX: do we need to check rec_len == 0 case? -Chris */
- de = (struct ext3_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+ de = ext3_next_entry(de);
}
return count;
}
@@ -825,7 +826,7 @@ static inline int search_dirblock(struct buffer_head * bh,
return 1;
}
/* prevent looping on a bad block */
- de_len = le16_to_cpu(de->rec_len);
+ de_len = ext3_rec_len_from_disk(de->rec_len);
if (de_len <= 0)
return -1;
offset += de_len;
@@ -1138,7 +1139,7 @@ dx_move_dirents(char *from, char *to, struct dx_map_entry *map, int count)
rec_len = EXT3_DIR_REC_LEN(de->name_len);
memcpy (to, de, rec_len);
((struct ext3_dir_entry_2 *) to)->rec_len =
- cpu_to_le16(rec_len);
+ ext3_rec_len_to_disk(rec_len);
de->inode = 0;
map++;
to += rec_len;
@@ -1157,13 +1158,12 @@ static struct ext3_dir_entry_2* dx_pack_dirents(char *base, int size)

prev = to = de;
while ((char*)de < base + size) {
- next = (struct ext3_dir_entry_2 *) ((char *) de +
- le16_to_cpu(de->rec_len));
+ next = ext3_next_entry(de);
if (de->inode && de->name_len) {
rec_len = EXT3_DIR_REC_LEN(de->name_len);
if (de > to)
memmove(to, de, rec_len);
- to->rec_len = cpu_to_le16(rec_len);
+ to->rec_len = ext3_rec_len_to_disk(rec_len);
prev = to;
to = (struct ext3_dir_entry_2 *) (((char *) to) + rec_len);
}
@@ -1237,8 +1237,8 @@ static struct ext3_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
/* Fancy dance to stay within two buffers */
de2 = dx_move_dirents(data1, data2, map + split, count - split);
de = dx_pack_dirents(data1,blocksize);
- de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
- de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+ de->rec_len = ext3_rec_len_to_disk(data1 + blocksize - (char *) de);
+ de2->rec_len = ext3_rec_len_to_disk(data2 + blocksize - (char *) de2);
dxtrace(dx_show_leaf (hinfo, (struct ext3_dir_entry_2 *) data1, blocksize, 1));
dxtrace(dx_show_leaf (hinfo, (struct ext3_dir_entry_2 *) data2, blocksize, 1));

@@ -1309,7 +1309,7 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
return -EEXIST;
}
nlen = EXT3_DIR_REC_LEN(de->name_len);
- rlen = le16_to_cpu(de->rec_len);
+ rlen = ext3_rec_len_from_disk(de->rec_len);
if ((de->inode? rlen - nlen: rlen) >= reclen)
break;
de = (struct ext3_dir_entry_2 *)((char *)de + rlen);
@@ -1328,11 +1328,11 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,

/* By now the buffer is marked for journaling */
nlen = EXT3_DIR_REC_LEN(de->name_len);
- rlen = le16_to_cpu(de->rec_len);
+ rlen = ext3_rec_len_from_disk(de->rec_len);
if (de->inode) {
struct ext3_dir_entry_2 *de1 = (struct ext3_dir_entry_2 *)((char *)de + nlen);
- de1->rec_len = cpu_to_le16(rlen - nlen);
- de->rec_len = cpu_to_le16(nlen);
+ de1->rec_len = ext3_rec_len_to_disk(rlen - nlen);
+ de->rec_len = ext3_rec_len_to_disk(nlen);
de = de1;
}
de->file_type = EXT3_FT_UNKNOWN;
@@ -1410,17 +1410,18 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,

/* The 0th block becomes the root, move the dirents out */
fde = &root->dotdot;
- de = (struct ext3_dir_entry_2 *)((char *)fde + le16_to_cpu(fde->rec_len));
+ de = (struct ext3_dir_entry_2 *)((char *)fde +
+ ext3_rec_len_from_disk(fde->rec_len));
len = ((char *) root) + blocksize - (char *) de;
memcpy (data1, de, len);
de = (struct ext3_dir_entry_2 *) data1;
top = data1 + len;
- while ((char *)(de2=(void*)de+le16_to_cpu(de->rec_len)) < top)
+ while ((char *)(de2 = ext3_next_entry(de)) < top)
de = de2;
- de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+ de->rec_len = ext3_rec_len_to_disk(data1 + blocksize - (char *) de);
/* Initialize the root; the dot dirents already exist */
de = (struct ext3_dir_entry_2 *) (&root->dotdot);
- de->rec_len = cpu_to_le16(blocksize - EXT3_DIR_REC_LEN(2));
+ de->rec_len = ext3_rec_len_to_disk(blocksize - EXT3_DIR_REC_LEN(2));
memset (&root->info, 0, sizeof(root->info));
root->info.info_length = sizeof(root->info);
root->info.hash_version = EXT3_SB(dir->i_sb)->s_def_hash_version;
@@ -1507,7 +1508,7 @@ static int ext3_add_entry (handle_t *handle, struct dentry *dentry,
return retval;
de = (struct ext3_dir_entry_2 *) bh->b_data;
de->inode = 0;
- de->rec_len = cpu_to_le16(blocksize);
+ de->rec_len = ext3_rec_len_to_disk(blocksize);
return add_dirent_to_buf(handle, dentry, inode, de, bh);
}

@@ -1571,7 +1572,7 @@ static int ext3_dx_add_entry(handle_t *handle, struct dentry *dentry,
goto cleanup;
node2 = (struct dx_node *)(bh2->b_data);
entries2 = node2->entries;
- node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
+ node2->fake.rec_len = ext3_rec_len_to_disk(sb->s_blocksize);
node2->fake.inode = 0;
BUFFER_TRACE(frame->bh, "get_write_access");
err = ext3_journal_get_write_access(handle, frame->bh);
@@ -1670,9 +1671,9 @@ static int ext3_delete_entry (handle_t *handle,
BUFFER_TRACE(bh, "get_write_access");
ext3_journal_get_write_access(handle, bh);
if (pde)
- pde->rec_len =
- cpu_to_le16(le16_to_cpu(pde->rec_len) +
- le16_to_cpu(de->rec_len));
+ pde->rec_len = ext3_rec_len_to_disk(
+ ext3_rec_len_from_disk(pde->rec_len) +
+ ext3_rec_len_from_disk(de->rec_len));
else
de->inode = 0;
dir->i_version++;
@@ -1680,10 +1681,9 @@ static int ext3_delete_entry (handle_t *handle,
ext3_journal_dirty_metadata(handle, bh);
return 0;
}
- i += le16_to_cpu(de->rec_len);
+ i += ext3_rec_len_from_disk(de->rec_len);
pde = de;
- de = (struct ext3_dir_entry_2 *)
- ((char *) de + le16_to_cpu(de->rec_len));
+ de = ext3_next_entry(de);
}
return -ENOENT;
}
@@ -1817,13 +1817,12 @@ retry:
de = (struct ext3_dir_entry_2 *) dir_block->b_data;
de->inode = cpu_to_le32(inode->i_ino);
de->name_len = 1;
- de->rec_len = cpu_to_le16(EXT3_DIR_REC_LEN(de->name_len));
+ de->rec_len = ext3_rec_len_to_disk(EXT3_DIR_REC_LEN(de->name_len));
strcpy (de->name, ".");
ext3_set_de_type(dir->i_sb, de, S_IFDIR);
- de = (struct ext3_dir_entry_2 *)
- ((char *) de + le16_to_cpu(de->rec_len));
+ de = ext3_next_entry(de);
de->inode = cpu_to_le32(dir->i_ino);
- de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize-EXT3_DIR_REC_LEN(1));
+ de->rec_len = ext3_rec_len_to_disk(inode->i_sb->s_blocksize-EXT3_DIR_REC_LEN(1));
de->name_len = 2;
strcpy (de->name, "..");
ext3_set_de_type(dir->i_sb, de, S_IFDIR);
@@ -1875,8 +1874,7 @@ static int empty_dir (struct inode * inode)
return 1;
}
de = (struct ext3_dir_entry_2 *) bh->b_data;
- de1 = (struct ext3_dir_entry_2 *)
- ((char *) de + le16_to_cpu(de->rec_len));
+ de1 = ext3_next_entry(de);
if (le32_to_cpu(de->inode) != inode->i_ino ||
!le32_to_cpu(de1->inode) ||
strcmp (".", de->name) ||
@@ -1887,9 +1885,9 @@ static int empty_dir (struct inode * inode)
brelse (bh);
return 1;
}
- offset = le16_to_cpu(de->rec_len) + le16_to_cpu(de1->rec_len);
- de = (struct ext3_dir_entry_2 *)
- ((char *) de1 + le16_to_cpu(de1->rec_len));
+ offset = ext3_rec_len_from_disk(de->rec_len) +
+ ext3_rec_len_from_disk(de1->rec_len);
+ de = ext3_next_entry(de1);
while (offset < inode->i_size ) {
if (!bh ||
(void *) de >= (void *) (bh->b_data+sb->s_blocksize)) {
@@ -1918,9 +1916,8 @@ static int empty_dir (struct inode * inode)
brelse (bh);
return 0;
}
- offset += le16_to_cpu(de->rec_len);
- de = (struct ext3_dir_entry_2 *)
- ((char *) de + le16_to_cpu(de->rec_len));
+ offset += ext3_rec_len_from_disk(de->rec_len);
+ de = ext3_next_entry(de);
}
brelse (bh);
return 1;
@@ -2274,8 +2271,7 @@ retry:
}

#define PARENT_INO(buffer) \
- ((struct ext3_dir_entry_2 *) ((char *) buffer + \
- le16_to_cpu(((struct ext3_dir_entry_2 *) buffer)->rec_len)))->inode
+ (ext3_next_entry((struct ext3_dir_entry_2 *)(buffer))->inode)

/*
* Anybody can rename anything with this: the permission checks are left to the
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 7aa5556..d9e378d 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -660,6 +660,26 @@ struct ext3_dir_entry_2 {
#define EXT3_DIR_ROUND (EXT3_DIR_PAD - 1)
#define EXT3_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT3_DIR_ROUND) & \
~EXT3_DIR_ROUND)
+#define EXT3_MAX_REC_LEN ((1<<16)-1)
+
+static inline unsigned ext3_rec_len_from_disk(__le16 dlen)
+{
+ unsigned len = le16_to_cpu(dlen);
+
+ if (len == EXT3_MAX_REC_LEN)
+ return 1 << 16;
+ return len;
+}
+
+static inline __le16 ext3_rec_len_to_disk(unsigned len)
+{
+ if (len == (1 << 16))
+ return cpu_to_le16(EXT3_MAX_REC_LEN);
+ else if (len > (1 << 16))
+ BUG();
+ return cpu_to_le16(len);
+}
+
/*
* Hash Tree Directory indexing
* (c) Daniel Phillips, 2001


2007-10-04 20:12:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

On Mon, 01 Oct 2007 17:35:46 -0700
Mingming Cao <[email protected]> wrote:

> ext2: Avoid rec_len overflow with 64KB block size
>
> From: Jan Kara <[email protected]>
>
> With 64KB blocksize, a directory entry can have size 64KB which does not fit
> into 16 bits we have for entry lenght. So we store 0xffff instead and convert
> value when read from / written to disk.

This patch clashes in non-trivial ways with
ext2-convert-to-new-aops-fix.patch and perhaps other things which are
already queued for 2.6.24 inclusion, so I'll need to ask for an updated
patch, please.

Also, I'm planing on merging the ext2 reservations code into 2.6.24, so if
we're aiming for complete support of 64k blocksize in 2.6.24's ext2,
additional testing and checking will be needed.

2007-10-04 22:40:56

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

On Oct 04, 2007 13:12 -0700, Andrew Morton wrote:
> On Mon, 01 Oct 2007 17:35:46 -0700
> > ext2: Avoid rec_len overflow with 64KB block size
> >
> > into 16 bits we have for entry lenght. So we store 0xffff instead and
> > convert value when read from / written to disk.
>
> This patch clashes in non-trivial ways with
> ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> patch, please.

If the rel_len overflow patch isn't going to make it, then we also need
to revert the EXT*_MAX_BLOCK_SIZE change to 65536. It would be possible
to allow this to be up to 32768 w/o the rec_len overflow fix however.

Yes, this does imply that those patches were in the wrong order in the
patch series, and I apologize for that, even if it isn't my fault.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-10-04 23:12:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

On Thu, 4 Oct 2007 16:40:44 -0600
Andreas Dilger <[email protected]> wrote:

> On Oct 04, 2007 13:12 -0700, Andrew Morton wrote:
> > On Mon, 01 Oct 2007 17:35:46 -0700
> > > ext2: Avoid rec_len overflow with 64KB block size
> > >
> > > into 16 bits we have for entry lenght. So we store 0xffff instead and
> > > convert value when read from / written to disk.
> >
> > This patch clashes in non-trivial ways with
> > ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> > already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> > patch, please.
>
> If the rel_len overflow patch isn't going to make it, then we also need
> to revert the EXT*_MAX_BLOCK_SIZE change to 65536. It would be possible
> to allow this to be up to 32768 w/o the rec_len overflow fix however.
>

Ok, thanks, I dropped ext3-support-large-blocksize-up-to-pagesize.patch and
ext2-support-large-blocksize-up-to-pagesize.patch.

2007-10-08 12:40:31

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

On Thu 04-10-07 13:12:07, Andrew Morton wrote:
> On Mon, 01 Oct 2007 17:35:46 -0700
> Mingming Cao <[email protected]> wrote:
>
> > ext2: Avoid rec_len overflow with 64KB block size
> >
> > From: Jan Kara <[email protected]>
> >
> > With 64KB blocksize, a directory entry can have size 64KB which does not fit
> > into 16 bits we have for entry lenght. So we store 0xffff instead and convert
> > value when read from / written to disk.
>
> This patch clashes in non-trivial ways with
> ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> patch, please.
>
> Also, I'm planing on merging the ext2 reservations code into 2.6.24, so if
> we're aiming for complete support of 64k blocksize in 2.6.24's ext2,
> additional testing and checking will be needed.
OK, I'll fixup those rejects and send a new patch.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2007-10-11 10:07:31

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

On Thu 04-10-07 16:11:21, Andrew Morton wrote:
> On Thu, 4 Oct 2007 16:40:44 -0600
> Andreas Dilger <[email protected]> wrote:
>
> > On Oct 04, 2007 13:12 -0700, Andrew Morton wrote:
> > > On Mon, 01 Oct 2007 17:35:46 -0700
> > > > ext2: Avoid rec_len overflow with 64KB block size
> > > >
> > > > into 16 bits we have for entry lenght. So we store 0xffff instead and
> > > > convert value when read from / written to disk.
> > >
> > > This patch clashes in non-trivial ways with
> > > ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> > > already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> > > patch, please.
> >
> > If the rel_len overflow patch isn't going to make it, then we also need
> > to revert the EXT*_MAX_BLOCK_SIZE change to 65536. It would be possible
> > to allow this to be up to 32768 w/o the rec_len overflow fix however.
> >
>
> Ok, thanks, I dropped ext3-support-large-blocksize-up-to-pagesize.patch and
> ext2-support-large-blocksize-up-to-pagesize.patch.
Sorry, for the delayed answer but I had some urgent bugs to fix...
Why did you drom ext3-support-large-blocksize-up-to-pagesize.patch? As far
as I understand your previous email (and also as I've checked against
2.6.23-rc8-mm2), the patch fixing rec_len overflow clashes only for ext2...
I'll send you an updated patch for ext2 in a moment.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2007-10-11 10:14:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

On Thu, 11 Oct 2007 12:30:03 +0200 Jan Kara <[email protected]> wrote:

> On Thu 04-10-07 16:11:21, Andrew Morton wrote:
> > On Thu, 4 Oct 2007 16:40:44 -0600
> > Andreas Dilger <[email protected]> wrote:
> >
> > > On Oct 04, 2007 13:12 -0700, Andrew Morton wrote:
> > > > On Mon, 01 Oct 2007 17:35:46 -0700
> > > > > ext2: Avoid rec_len overflow with 64KB block size
> > > > >
> > > > > into 16 bits we have for entry lenght. So we store 0xffff instead and
> > > > > convert value when read from / written to disk.
> > > >
> > > > This patch clashes in non-trivial ways with
> > > > ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> > > > already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> > > > patch, please.
> > >
> > > If the rel_len overflow patch isn't going to make it, then we also need
> > > to revert the EXT*_MAX_BLOCK_SIZE change to 65536. It would be possible
> > > to allow this to be up to 32768 w/o the rec_len overflow fix however.
> > >
> >
> > Ok, thanks, I dropped ext3-support-large-blocksize-up-to-pagesize.patch and
> > ext2-support-large-blocksize-up-to-pagesize.patch.
> Sorry, for the delayed answer but I had some urgent bugs to fix...

You exceeded my memory span.

> Why did you drom ext3-support-large-blocksize-up-to-pagesize.patch?

I forget. I'll bring it back and see what happens.

> As far
> as I understand your previous email (and also as I've checked against
> 2.6.23-rc8-mm2), the patch fixing rec_len overflow clashes only for ext2...
> I'll send you an updated patch for ext2 in a moment.

ok.. I'm basically not applying anything any more - the whole thing
is a teetering wreck. I need to go through the input queue delicately
adding things which look important or relatively non-injurious.

2007-10-11 10:56:19

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

On Thu 04-10-07 13:12:07, Andrew Morton wrote:
> On Mon, 01 Oct 2007 17:35:46 -0700
> Mingming Cao <[email protected]> wrote:
>
> > ext2: Avoid rec_len overflow with 64KB block size
> >
> > From: Jan Kara <[email protected]>
> >
> > With 64KB blocksize, a directory entry can have size 64KB which does not fit
> > into 16 bits we have for entry lenght. So we store 0xffff instead and convert
> > value when read from / written to disk.
>
> This patch clashes in non-trivial ways with
> ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> patch, please.
>
> Also, I'm planing on merging the ext2 reservations code into 2.6.24, so if
> we're aiming for complete support of 64k blocksize in 2.6.24's ext2,
> additional testing and checking will be needed.
OK, attached is a patch diffed against 2.6.23-rc9-mm2 - does that work
fine for you?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

------

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk.

Signed-off-by: Jan Kara <[email protected]>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-mm/fs/ext2/dir.c linux-2.6.23-mm-1-ext2_64k_rec_len/fs/ext2/dir.c
--- linux-2.6.23-mm/fs/ext2/dir.c 2007-10-11 12:08:16.000000000 +0200
+++ linux-2.6.23-mm-1-ext2_64k_rec_len/fs/ext2/dir.c 2007-10-11 12:14:24.000000000 +0200
@@ -28,6 +28,24 @@

typedef struct ext2_dir_entry_2 ext2_dirent;

+static inline unsigned ext2_rec_len_from_disk(__le16 dlen)
+{
+ unsigned len = le16_to_cpu(dlen);
+
+ if (len == EXT2_MAX_REC_LEN)
+ return 1 << 16;
+ return len;
+}
+
+static inline __le16 ext2_rec_len_to_disk(unsigned len)
+{
+ if (len == (1 << 16))
+ return cpu_to_le16(EXT2_MAX_REC_LEN);
+ else if (len > (1 << 16))
+ BUG();
+ return cpu_to_le16(len);
+}
+
/*
* ext2 uses block-sized chunks. Arguably, sector-sized ones would be
* more robust, but we have what we have
@@ -106,7 +124,7 @@ static void ext2_check_page(struct page
}
for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
p = (ext2_dirent *)(kaddr + offs);
- rec_len = le16_to_cpu(p->rec_len);
+ rec_len = ext2_rec_len_from_disk(p->rec_len);

if (rec_len < EXT2_DIR_REC_LEN(1))
goto Eshort;
@@ -204,7 +222,8 @@ static inline int ext2_match (int len, c
*/
static inline ext2_dirent *ext2_next_entry(ext2_dirent *p)
{
- return (ext2_dirent *)((char*)p + le16_to_cpu(p->rec_len));
+ return (ext2_dirent *)((char*)p +
+ ext2_rec_len_from_disk(p->rec_len));
}

static inline unsigned
@@ -316,7 +335,7 @@ ext2_readdir (struct file * filp, void *
return 0;
}
}
- filp->f_pos += le16_to_cpu(de->rec_len);
+ filp->f_pos += ext2_rec_len_from_disk(de->rec_len);
}
ext2_put_page(page);
}
@@ -425,7 +444,7 @@ void ext2_set_link(struct inode *dir, st
{
loff_t pos = page_offset(page) +
(char *) de - (char *) page_address(page);
- unsigned len = le16_to_cpu(de->rec_len);
+ unsigned len = ext2_rec_len_from_disk(de->rec_len);
int err;

lock_page(page);
@@ -482,7 +501,7 @@ int ext2_add_link (struct dentry *dentry
/* We hit i_size */
name_len = 0;
rec_len = chunk_size;
- de->rec_len = cpu_to_le16(chunk_size);
+ de->rec_len = ext2_rec_len_to_disk(chunk_size);
de->inode = 0;
goto got_it;
}
@@ -496,7 +515,7 @@ int ext2_add_link (struct dentry *dentry
if (ext2_match (namelen, name, de))
goto out_unlock;
name_len = EXT2_DIR_REC_LEN(de->name_len);
- rec_len = le16_to_cpu(de->rec_len);
+ rec_len = ext2_rec_len_from_disk(de->rec_len);
if (!de->inode && rec_len >= reclen)
goto got_it;
if (rec_len >= name_len + reclen)
@@ -518,8 +537,8 @@ got_it:
goto out_unlock;
if (de->inode) {
ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
- de1->rec_len = cpu_to_le16(rec_len - name_len);
- de->rec_len = cpu_to_le16(name_len);
+ de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
+ de->rec_len = ext2_rec_len_to_disk(name_len);
de = de1;
}
de->name_len = namelen;
@@ -550,7 +569,8 @@ int ext2_delete_entry (struct ext2_dir_e
struct inode *inode = mapping->host;
char *kaddr = page_address(page);
unsigned from = ((char*)dir - kaddr) & ~(ext2_chunk_size(inode)-1);
- unsigned to = ((char*)dir - kaddr) + le16_to_cpu(dir->rec_len);
+ unsigned to = ((char*)dir - kaddr) +
+ ext2_rec_len_from_disk(dir->rec_len);
loff_t pos;
ext2_dirent * pde = NULL;
ext2_dirent * de = (ext2_dirent *) (kaddr + from);
@@ -574,7 +594,7 @@ int ext2_delete_entry (struct ext2_dir_e
&page, NULL);
BUG_ON(err);
if (pde)
- pde->rec_len = cpu_to_le16(to - from);
+ pde->rec_len = ext2_rec_len_to_disk(to - from);
dir->inode = 0;
err = ext2_commit_chunk(page, pos, to - from);
inode->i_ctime = inode->i_mtime = CURRENT_TIME_SEC;
@@ -610,14 +630,14 @@ int ext2_make_empty(struct inode *inode,
memset(kaddr, 0, chunk_size);
de = (struct ext2_dir_entry_2 *)kaddr;
de->name_len = 1;
- de->rec_len = cpu_to_le16(EXT2_DIR_REC_LEN(1));
+ de->rec_len = ext2_rec_len_to_disk(EXT2_DIR_REC_LEN(1));
memcpy (de->name, ".\0\0", 4);
de->inode = cpu_to_le32(inode->i_ino);
ext2_set_de_type (de, inode);

de = (struct ext2_dir_entry_2 *)(kaddr + EXT2_DIR_REC_LEN(1));
de->name_len = 2;
- de->rec_len = cpu_to_le16(chunk_size - EXT2_DIR_REC_LEN(1));
+ de->rec_len = ext2_rec_len_to_disk(chunk_size - EXT2_DIR_REC_LEN(1));
de->inode = cpu_to_le32(parent->i_ino);
memcpy (de->name, "..\0", 4);
ext2_set_de_type (de, inode);
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-mm/include/linux/ext2_fs.h linux-2.6.23-mm-1-ext2_64k_rec_len/include/linux/ext2_fs.h
--- linux-2.6.23-mm/include/linux/ext2_fs.h 2007-10-11 12:08:34.000000000 +0200
+++ linux-2.6.23-mm-1-ext2_64k_rec_len/include/linux/ext2_fs.h 2007-10-11 12:11:22.000000000 +0200
@@ -561,6 +561,7 @@ enum {
#define EXT2_DIR_ROUND (EXT2_DIR_PAD - 1)
#define EXT2_DIR_REC_LEN(name_len) (((name_len) + 8 + EXT2_DIR_ROUND) & \
~EXT2_DIR_ROUND)
+#define EXT2_MAX_REC_LEN ((1<<16)-1)

static inline ext2_fsblk_t
ext2_group_first_block_no(struct super_block *sb, unsigned long group_no)

2007-10-18 04:07:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

On Thu, 11 Oct 2007 13:18:49 +0200 Jan Kara <[email protected]> wrote:

> +static inline __le16 ext2_rec_len_to_disk(unsigned len)
> +{
> + if (len == (1 << 16))
> + return cpu_to_le16(EXT2_MAX_REC_LEN);
> + else if (len > (1 << 16))
> + BUG();
> + return cpu_to_le16(len);
> +}

Of course, ext2 shouldn't be trying to write a bad record length into a
directory entry. But are we sure that there is no way in which this
situation could occur is the on-disk data was _already_ bad?

Because it is very bad for a fileysstem to go BUG in response to unexpected
data on the disk.

2007-10-18 04:09:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

On Thu, 11 Oct 2007 13:18:49 +0200 Jan Kara <[email protected]> wrote:

> With 64KB blocksize, a directory entry can have size 64KB which does not fit
> into 16 bits we have for entry lenght. So we store 0xffff instead and convert
> value when read from / written to disk.

btw, this changes ext2's on-disk format.

a) is the ext2 format documented anywhere? If so, that document will
need updating.

b) what happens when an old ext2 driver tries to read and/or write this
directory entry? Do we need a compat flag for it?

c) what happens when old and new ext3 or ext4 try to read/write this
directory entry?


2007-10-18 09:03:52

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

On Wed, 17 Oct 2007, Andrew Morton wrote:

> b) what happens when an old ext2 driver tries to read and/or write this
> directory entry? Do we need a compat flag for it?

Old ext2 only supports up to 4k

include/linux/ext2_fs.h:

#define EXT2_MIN_BLOCK_SIZE 1024
#define EXT2_MAX_BLOCK_SIZE 4096
#define EXT2_MIN_BLOCK_LOG_SIZE 10

Should fail to mount the volume since the block size is too large.

2007-10-18 09:12:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

On Thu, 18 Oct 2007 02:03:39 -0700 (PDT) Christoph Lameter <[email protected]> wrote:

> On Wed, 17 Oct 2007, Andrew Morton wrote:
>
> > b) what happens when an old ext2 driver tries to read and/or write this
> > directory entry? Do we need a compat flag for it?
>
> Old ext2 only supports up to 4k
>
> include/linux/ext2_fs.h:
>
> #define EXT2_MIN_BLOCK_SIZE 1024
> #define EXT2_MAX_BLOCK_SIZE 4096
> #define EXT2_MIN_BLOCK_LOG_SIZE 10
>
> Should fail to mount the volume since the block size is too large.

should, but does it?

box:/usr/src/25> grep MAX_BLOCK_SIZE fs/ext2/*.[ch] include/linux/ext2*
include/linux/ext2_fs.h:#define EXT2_MAX_BLOCK_SIZE 4096
box:/usr/src/25>

2007-10-19 02:06:44

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size

On Wed, 2007-10-17 at 21:09 -0700, Andrew Morton wrote:
> On Thu, 11 Oct 2007 13:18:49 +0200 Jan Kara <[email protected]> wrote:
>
> > With 64KB blocksize, a directory entry can have size 64KB which does not fit
> > into 16 bits we have for entry lenght. So we store 0xffff instead and convert
> > value when read from / written to disk.
>
> btw, this changes ext2's on-disk format.
>
Just to clarify this is only changes the directory entries format on
ext2/3/4 fs with 64k block size. But currently without kernel changes
ext2/3/4 does not support 64 block size.

> a) is the ext2 format documented anywhere? If so, that document will
> need updating.
>

The e2fsprogs needs to be changed to sync up with this change.

Ted has a paper a while back to show ext2 disk format
http://web.mit.edu/tytso/www/linux/ext2intro.html

The Documentation/filesystems/ext2.txt doesn't have the ext2 format
documented. That document is out-dated need to be reviewed and cleaned
up.

> b) what happens when an old ext2 driver tries to read and/or write this
> directory entry? Do we need a compat flag for it?
>
> c) what happens when old and new ext3 or ext4 try to read/write this
> directory entry?
>

Without the first patch in this series: ext2 large blocksize support
patches, it fails to mount a ext2 filesystem with 64k block size.

[PATCH 1/2] ext2: Support large blocksize up to PAGESIZE
http://lkml.org/lkml/2007/10/1/361

So the old ext2/3/4 driver will not get access the directory entry with
64k block size format changes.


Regards,

Mingming

> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html