Hello,
This patch try to solve problem that a long-lasting page caches of
ext4 superblock and journaling of superblock disturb page migration.
I've been testing CMA feature on my ARM-based platform
and found that two page caches cannot be migrated.
They are page caches of superblock of ext4 filesystem and its journaling data.
Current ext4 reads superblock with sb_bread() that allocates page
from movable area. But the problem is that ext4 hold the page until
it is unmounted. If root filesystem is ext4 the page cannot be migrated forever.
And also the journaling data for the superblock cannot be migreated.
I introduce a new API for allocating page cache from non-movable area.
It is useful for ext4/ext3 and others that want to hold page cache for a long time.
I have 3 patchs:
1. Patch 1/3: introduce a new API that create page cache from non-movable area
2. Patch 2/3: have ext4 use the new API to read superblock
3. Patch 3/3: have jbd/jbd2 use the new API to make journaling of superblock
This patchset is based on linux-next-20140814.
Thanks a lot.
Gioh Kim (3):
fs/buffer.c: allocate buffer cache from non-movable area
ext4: allocate buffer-cache for superblock in non-movable area
jbd-jbd2-allocate-buffer-cache-for-superblock-inode-.patch
fs/buffer.c | 63 ++++++++++++++++++++++++++++++----
fs/ext4/super.c | 6 +--
fs/jbd/journal.c | 2 -
fs/jbd2/journal.c | 2 -
include/linux/buffer_head.h | 10 +++++
5 files changed, 71 insertions(+), 12 deletions(-)
A buffer cache is allocated from movable area
because it is referred for a while and released soon.
But some filesystems are taking buffer cache for a long time
and it can disturb page migration.
A new API should be introduced to allocate buffer cache from
non-movable area.
Signed-off-by: Gioh Kim <[email protected]>
---
fs/buffer.c | 63 ++++++++++++++++++++++++++++++++++++++-----
include/linux/buffer_head.h | 10 +++++++
2 files changed, 66 insertions(+), 7 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 8f05111..7ef658f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -993,7 +993,7 @@ init_page_buffers(struct page *page, struct block_device *bdev,
*/
static int
grow_dev_page(struct block_device *bdev, sector_t block,
- pgoff_t index, int size, int sizebits)
+ pgoff_t index, int size, int sizebits, gfp_t movable_mask)
{
struct inode *inode = bdev->bd_inode;
struct page *page;
@@ -1003,7 +1003,8 @@ grow_dev_page(struct block_device *bdev, sector_t block,
gfp_t gfp_mask;
gfp_mask = mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS;
- gfp_mask |= __GFP_MOVABLE;
+ if (movable_mask & __GFP_MOVABLE)
+ gfp_mask |= __GFP_MOVABLE;
/*
* XXX: __getblk_slow() can not really deal with failure and
* will endlessly loop on improvised global reclaim. Prefer
@@ -1058,7 +1059,8 @@ failed:
* that page was dirty, the buffers are set dirty also.
*/
static int
-grow_buffers(struct block_device *bdev, sector_t block, int size)
+grow_buffers(struct block_device *bdev, sector_t block,
+ int size, gfp_t movable_mask)
{
pgoff_t index;
int sizebits;
@@ -1085,11 +1087,12 @@ grow_buffers(struct block_device *bdev, sector_t block, int size)
}
/* Create a page with the proper size buffers.. */
- return grow_dev_page(bdev, block, index, size, sizebits);
+ return grow_dev_page(bdev, block, index, size, sizebits, movable_mask);
}
static struct buffer_head *
-__getblk_slow(struct block_device *bdev, sector_t block, int size)
+__getblk_slow(struct block_device *bdev, sector_t block,
+ int size, gfp_t movable_mask)
{
/* Size must be multiple of hard sectorsize */
if (unlikely(size & (bdev_logical_block_size(bdev)-1) ||
@@ -1111,7 +1114,7 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
if (bh)
return bh;
- ret = grow_buffers(bdev, block, size);
+ ret = grow_buffers(bdev, block, size, movable_mask);
if (ret < 0)
return NULL;
if (ret == 0)
@@ -1385,11 +1388,34 @@ __getblk(struct block_device *bdev, sector_t block, unsigned size)
might_sleep();
if (bh == NULL)
- bh = __getblk_slow(bdev, block, size);
+ bh = __getblk_slow(bdev, block, size, __GFP_MOVABLE);
return bh;
}
EXPORT_SYMBOL(__getblk);
+ /*
+ * __getblk_nonmovable will locate (and, if necessary, create) the buffer_head
+ * which corresponds to the passed block_device, block and size. The
+ * returned buffer has its reference count incremented.
+ *
+ * The page cache is allocated from non-movable area
+ * not to prevent page migration.
+ *
+ * __getblk()_nonmovable will lock up the machine
+ * if grow_dev_page's try_to_free_buffers() attempt is failing. FIXME, perhaps?
+ */
+struct buffer_head *
+__getblk_nonmovable(struct block_device *bdev, sector_t block, unsigned size)
+{
+ struct buffer_head *bh = __find_get_block(bdev, block, size);
+
+ might_sleep();
+ if (bh == NULL)
+ bh = __getblk_slow(bdev, block, size, 0);
+ return bh;
+}
+EXPORT_SYMBOL(__getblk_nonmovable);
+
/*
* Do async read-ahead on a buffer..
*/
@@ -1410,6 +1436,7 @@ EXPORT_SYMBOL(__breadahead);
* @size: size (in bytes) to read
*
* Reads a specified block, and returns buffer head that contains it.
+ * The page cache is allocated from movable area so that it can be migrated.
* It returns NULL if the block was unreadable.
*/
struct buffer_head *
@@ -1423,6 +1450,28 @@ __bread(struct block_device *bdev, sector_t block, unsigned size)
}
EXPORT_SYMBOL(__bread);
+/**
+ * __bread_nonmovable() - reads a specified block and returns the bh
+ * @bdev: the block_device to read from
+ * @block: number of block
+ * @size: size (in bytes) to read
+ *
+ * Reads a specified block, and returns buffer head that contains it.
+ * The page cache is allocated from non-movable area
+ * not to prevent page migration.
+ * It returns NULL if the block was unreadable.
+ */
+struct buffer_head *
+__bread_nonmovable(struct block_device *bdev, sector_t block, unsigned size)
+{
+ struct buffer_head *bh = __getblk_slow(bdev, block, size, 0);
+
+ if (likely(bh) && !buffer_uptodate(bh))
+ bh = __bread_slow(bh);
+ return bh;
+}
+EXPORT_SYMBOL(__bread_nonmovable);
+
/*
* invalidate_bh_lrus() is called rarely - but not only at unmount.
* This doesn't race because it runs in each cpu either in irq
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 324329c..3f52370 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -177,10 +177,14 @@ struct buffer_head *__find_get_block(struct block_device *bdev, sector_t block,
unsigned size);
struct buffer_head *__getblk(struct block_device *bdev, sector_t block,
unsigned size);
+struct buffer_head *__getblk_nonmovable(struct block_device *bdev,
+ sector_t block, unsigned size);
void __brelse(struct buffer_head *);
void __bforget(struct buffer_head *);
void __breadahead(struct block_device *, sector_t block, unsigned int size);
struct buffer_head *__bread(struct block_device *, sector_t block, unsigned size);
+struct buffer_head *__bread_nonmovable(struct block_device *,
+ sector_t block, unsigned size);
void invalidate_bh_lrus(void);
struct buffer_head *alloc_buffer_head(gfp_t gfp_flags);
void free_buffer_head(struct buffer_head * bh);
@@ -298,6 +302,12 @@ sb_bread(struct super_block *sb, sector_t block)
return __bread(sb->s_bdev, block, sb->s_blocksize);
}
+static inline struct buffer_head *
+sb_bread_nonmovable(struct super_block *sb, sector_t block)
+{
+ return __bread_nonmovable(sb->s_bdev, block, sb->s_blocksize);
+}
+
static inline void
sb_breadahead(struct super_block *sb, sector_t block)
{
--
1.7.9.5
A buffer-cache for superblock is disturbing page migration,
because the buffer-cache is not released until unmount.
The buffer-cache must be allocated from non-movable area.
Signed-off-by: Gioh Kim <[email protected]>
---
fs/ext4/super.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 32b43ad..0630a88 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3435,7 +3435,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
logical_sb_block = sb_block;
}
- if (!(bh = sb_bread(sb, logical_sb_block))) {
+ if (!(bh = sb_bread_nonmovable(sb, logical_sb_block))) {
ext4_msg(sb, KERN_ERR, "unable to read superblock");
goto out_fail;
}
@@ -3645,7 +3645,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
brelse(bh);
logical_sb_block = sb_block * EXT4_MIN_BLOCK_SIZE;
offset = do_div(logical_sb_block, blocksize);
- bh = sb_bread(sb, logical_sb_block);
+ bh = sb_bread_nonmovable(sb, logical_sb_block);
if (!bh) {
ext4_msg(sb, KERN_ERR,
"Can't read superblock on 2nd try");
@@ -3867,7 +3867,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
for (i = 0; i < db_count; i++) {
block = descriptor_loc(sb, logical_sb_block, i);
- sbi->s_group_desc[i] = sb_bread(sb, block);
+ sbi->s_group_desc[i] = sb_bread_nonmovable(sb, block);
if (!sbi->s_group_desc[i]) {
ext4_msg(sb, KERN_ERR,
"can't read group descriptor %d", i);
--
1.7.9.5
A long-lasting buffer-cache can distrub page migration so that
it must be allocated from non-movable area.
The journal_init_inode is creating a buffer-cache for superblock journaling.
The superblock exists until system shutdown so that the buffer-cache
for the superblock would also exist for a long time
and it can distrub page migration.
This patch make the buffer-cache be allocated from non-movable area
not to distrub page migration.
Signed-off-by: Gioh Kim <[email protected]>
---
fs/jbd/journal.c | 2 +-
fs/jbd2/journal.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 06fe11e..2ed8aa2 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -886,7 +886,7 @@ journal_t * journal_init_inode (struct inode *inode)
goto out_err;
}
- bh = __getblk(journal->j_dev, blocknr, journal->j_blocksize);
+ bh = __getblk_nonmovable(journal->j_dev, blocknr, journal->j_blocksize);
if (!bh) {
printk(KERN_ERR
"%s: Cannot get buffer for journal superblock\n",
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 67b8e30..a618e49 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1237,7 +1237,7 @@ journal_t * jbd2_journal_init_inode (struct inode *inode)
goto out_err;
}
- bh = __getblk(journal->j_dev, blocknr, journal->j_blocksize);
+ bh = __getblk_nonmovable(journal->j_dev, blocknr, journal->j_blocksize);
if (!bh) {
printk(KERN_ERR
"%s: Cannot get buffer for journal superblock\n",
--
1.7.9.5
I'm sorry for a typo at the tile.
It is 0/3, not 0/2.
I'm sorry for a typo at the title.
It is 1/3, not 1/2.
2014-08-14 ???? 2:12, Gioh Kim ?? ??:
> Hello,
>
> This patch try to solve problem that a long-lasting page caches of
> ext4 superblock and journaling of superblock disturb page migration.
>
> I've been testing CMA feature on my ARM-based platform
> and found that two page caches cannot be migrated.
> They are page caches of superblock of ext4 filesystem and its journaling data.
>
> Current ext4 reads superblock with sb_bread() that allocates page
> from movable area. But the problem is that ext4 hold the page until
> it is unmounted. If root filesystem is ext4 the page cannot be migrated forever.
> And also the journaling data for the superblock cannot be migreated.
>
> I introduce a new API for allocating page cache from non-movable area.
> It is useful for ext4/ext3 and others that want to hold page cache for a long time.
>
> I have 3 patchs:
>
> 1. Patch 1/3: introduce a new API that create page cache from non-movable area
> 2. Patch 2/3: have ext4 use the new API to read superblock
> 3. Patch 3/3: have jbd/jbd2 use the new API to make journaling of superblock
>
> This patchset is based on linux-next-20140814.
>
> Thanks a lot.
>
> Gioh Kim (3):
> fs/buffer.c: allocate buffer cache from non-movable area
> ext4: allocate buffer-cache for superblock in non-movable area
> jbd-jbd2-allocate-buffer-cache-for-superblock-inode-.patch
>
> fs/buffer.c | 63 ++++++++++++++++++++++++++++++----
> fs/ext4/super.c | 6 +--
> fs/jbd/journal.c | 2 -
> fs/jbd2/journal.c | 2 -
> include/linux/buffer_head.h | 10 +++++
> 5 files changed, 71 insertions(+), 12 deletions(-)
>
I'm sorry to forget mentioning that this is 2nd version of https://lkml.org/lkml/2014/7/22/34.
On Thu, 14 Aug 2014 14:15:40 +0900 Gioh Kim <[email protected]> wrote:
> A buffer cache is allocated from movable area
> because it is referred for a while and released soon.
> But some filesystems are taking buffer cache for a long time
> and it can disturb page migration.
>
> A new API should be introduced to allocate buffer cache from
> non-movable area.
I think the API could and should be more flexible than this.
Rather than making the API be "movable or not movable", let's permit
callers to specify the gfp_t and leave it at that. That way, if
someone later wants to allocate a buffer head with, I dunno,
__GFP_NOTRACK then they can do so.
So the word "movable" shouldn't appear in buffer.c at all, except in a
single place.
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -993,7 +993,7 @@ init_page_buffers(struct page *page, struct block_device *bdev,
> */
> static int
> grow_dev_page(struct block_device *bdev, sector_t block,
> - pgoff_t index, int size, int sizebits)
> + pgoff_t index, int size, int sizebits, gfp_t movable_mask)
s/movable_mask/gfp/
> {
> struct inode *inode = bdev->bd_inode;
> struct page *page;
> @@ -1003,7 +1003,8 @@ grow_dev_page(struct block_device *bdev, sector_t block,
> gfp_t gfp_mask;
>
> gfp_mask = mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS;
> - gfp_mask |= __GFP_MOVABLE;
> + if (movable_mask & __GFP_MOVABLE)
> + gfp_mask |= __GFP_MOVABLE;
This becomes
gfp_mask |= gfp;
> /*
> * XXX: __getblk_slow() can not really deal with failure and
> * will endlessly loop on improvised global reclaim. Prefer
> @@ -1058,7 +1059,8 @@ failed:
> * that page was dirty, the buffers are set dirty also.
> */
> static int
> -grow_buffers(struct block_device *bdev, sector_t block, int size)
> +grow_buffers(struct block_device *bdev, sector_t block,
> + int size, gfp_t movable_mask)
gfp
> {
> pgoff_t index;
> int sizebits;
> @@ -1085,11 +1087,12 @@ grow_buffers(struct block_device *bdev, sector_t block, int size)
> }
>
> /* Create a page with the proper size buffers.. */
> - return grow_dev_page(bdev, block, index, size, sizebits);
> + return grow_dev_page(bdev, block, index, size, sizebits, movable_mask);
> }
>
> static struct buffer_head *
> -__getblk_slow(struct block_device *bdev, sector_t block, int size)
> +__getblk_slow(struct block_device *bdev, sector_t block,
> + int size, gfp_t movable_mask)
gfp
> {
> /* Size must be multiple of hard sectorsize */
> if (unlikely(size & (bdev_logical_block_size(bdev)-1) ||
> @@ -1111,7 +1114,7 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
> if (bh)
> return bh;
>
> - ret = grow_buffers(bdev, block, size);
> + ret = grow_buffers(bdev, block, size, movable_mask);
gfp
> if (ret < 0)
> return NULL;
> if (ret == 0)
> @@ -1385,11 +1388,34 @@ __getblk(struct block_device *bdev, sector_t block, unsigned size)
>
> might_sleep();
> if (bh == NULL)
> - bh = __getblk_slow(bdev, block, size);
> + bh = __getblk_slow(bdev, block, size, __GFP_MOVABLE);
Here is the place where buffer.c. mentions "movable".
> return bh;
> }
> EXPORT_SYMBOL(__getblk);
>
> + /*
> + * __getblk_nonmovable will locate (and, if necessary, create) the buffer_head
> + * which corresponds to the passed block_device, block and size. The
> + * returned buffer has its reference count incremented.
> + *
> + * The page cache is allocated from non-movable area
> + * not to prevent page migration.
> + *
> + * __getblk()_nonmovable will lock up the machine
> + * if grow_dev_page's try_to_free_buffers() attempt is failing. FIXME, perhaps?
> + */
> +struct buffer_head *
> +__getblk_nonmovable(struct block_device *bdev, sector_t block, unsigned size)
> +{
> + struct buffer_head *bh = __find_get_block(bdev, block, size);
> +
> + might_sleep();
> + if (bh == NULL)
> + bh = __getblk_slow(bdev, block, size, 0);
> + return bh;
> +}
> +EXPORT_SYMBOL(__getblk_nonmovable);
Suggest this be called __getblk_gfp(bdev, block, size, gfp) and then
__getblk() be changed to call __getblk_gfp(..., __GFP_MOVABLE).
We could then write a __getblk_nonmovable() which calls __getblk_gfp()
(a static inlined one-line function) or we can just call
__getblk_gfp(..., 0) directly from filesystems.
> @@ -1423,6 +1450,28 @@ __bread(struct block_device *bdev, sector_t block, unsigned size)
> }
> EXPORT_SYMBOL(__bread);
>
> +/**
> + * __bread_nonmovable() - reads a specified block and returns the bh
> + * @bdev: the block_device to read from
> + * @block: number of block
> + * @size: size (in bytes) to read
> + *
> + * Reads a specified block, and returns buffer head that contains it.
> + * The page cache is allocated from non-movable area
> + * not to prevent page migration.
> + * It returns NULL if the block was unreadable.
> + */
> +struct buffer_head *
> +__bread_nonmovable(struct block_device *bdev, sector_t block, unsigned size)
> +{
> + struct buffer_head *bh = __getblk_slow(bdev, block, size, 0);
> +
> + if (likely(bh) && !buffer_uptodate(bh))
> + bh = __bread_slow(bh);
> + return bh;
> +}
> +EXPORT_SYMBOL(__bread_nonmovable);
Treat this in the same fashion as __getblk_nonmovable().
On Thu, 14 Aug 2014 14:12:17 +0900 Gioh Kim <[email protected]> wrote:
> This patch try to solve problem that a long-lasting page caches of
> ext4 superblock and journaling of superblock disturb page migration.
>
> I've been testing CMA feature on my ARM-based platform
> and found that two page caches cannot be migrated.
> They are page caches of superblock of ext4 filesystem and its journaling data.
>
> Current ext4 reads superblock with sb_bread() that allocates page
> from movable area. But the problem is that ext4 hold the page until
> it is unmounted. If root filesystem is ext4 the page cannot be migrated forever.
> And also the journaling data for the superblock cannot be migreated.
>
> I introduce a new API for allocating page cache from non-movable area.
> It is useful for ext4/ext3 and others that want to hold page cache for a long time.
All seems reasonable to me. The additional overhead in buffer.c from
additional function arguments is regrettable but I don't see a
non-hacky alternative.
One vital question which the changelog doesn't really address (it
should): how important is this patch? Is your test system presently
"completely dead in the water utterly unusable" or "occasionally not
quite as good as it could be". Somewhere in between?
See, the patch adds costs. I'd like us to have a good understanding of
what benefits it brings.
On Thu 14-08-14 14:26:10, Andrew Morton wrote:
> On Thu, 14 Aug 2014 14:12:17 +0900 Gioh Kim <[email protected]> wrote:
>
> > This patch try to solve problem that a long-lasting page caches of
> > ext4 superblock and journaling of superblock disturb page migration.
> >
> > I've been testing CMA feature on my ARM-based platform
> > and found that two page caches cannot be migrated.
> > They are page caches of superblock of ext4 filesystem and its journaling data.
> >
> > Current ext4 reads superblock with sb_bread() that allocates page
> > from movable area. But the problem is that ext4 hold the page until
> > it is unmounted. If root filesystem is ext4 the page cannot be migrated forever.
> > And also the journaling data for the superblock cannot be migreated.
> >
> > I introduce a new API for allocating page cache from non-movable area.
> > It is useful for ext4/ext3 and others that want to hold page cache for a long time.
>
> All seems reasonable to me. The additional overhead in buffer.c from
> additional function arguments is regrettable but I don't see a
> non-hacky alternative.
>
> One vital question which the changelog doesn't really address (it
> should): how important is this patch? Is your test system presently
> "completely dead in the water utterly unusable" or "occasionally not
> quite as good as it could be". Somewhere in between?
I would be also interested in how much these patches make things better.
Because I would expect all metadata that is currently journalled to be
unmovable as well.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
2014-08-17 오전 3:52, Jan Kara 쓴 글:
> On Thu 14-08-14 14:26:10, Andrew Morton wrote:
>> On Thu, 14 Aug 2014 14:12:17 +0900 Gioh Kim <[email protected]> wrote:
>>
>>> This patch try to solve problem that a long-lasting page caches of
>>> ext4 superblock and journaling of superblock disturb page migration.
>>>
>>> I've been testing CMA feature on my ARM-based platform
>>> and found that two page caches cannot be migrated.
>>> They are page caches of superblock of ext4 filesystem and its journaling data.
>>>
>>> Current ext4 reads superblock with sb_bread() that allocates page
>>> from movable area. But the problem is that ext4 hold the page until
>>> it is unmounted. If root filesystem is ext4 the page cannot be migrated forever.
>>> And also the journaling data for the superblock cannot be migreated.
>>>
>>> I introduce a new API for allocating page cache from non-movable area.
>>> It is useful for ext4/ext3 and others that want to hold page cache for a long time.
>>
>> All seems reasonable to me. The additional overhead in buffer.c from
>> additional function arguments is regrettable but I don't see a
>> non-hacky alternative.
>>
>> One vital question which the changelog doesn't really address (it
>> should): how important is this patch? Is your test system presently
>> "completely dead in the water utterly unusable" or "occasionally not
>> quite as good as it could be". Somewhere in between?
> I would be also interested in how much these patches make things better.
> Because I would expect all metadata that is currently journalled to be
> unmovable as well.
>
> Honza
>
I'm so sorry for lacking of detail.
My test platform has totally 1GB memory, 256MB for CMA and 768MB for normal.
I applied Joonsoo's patch: https://lkml.org/lkml/2014/5/28/64, so that
3/4 of allocation take place in normal area and 1/4 allocation take place in CMA area.
And my platform has 4 ext4 partitions. Each ext4 partition has 2 page caches for superblock that
are what this patch tries to move to out of CMA area.
Therefore there are 8 page caches (8 pages size) that can prevent page migration.
My test scenario is trying to allocate all CMA area: repeating 16MB allocation until all CMA area are allocated.
In the most cases 2 pages are allocated from CMA area and one allocation among 16 tries to allocation failed.
It is rare case that every allocation successes.
Applying this patch no page cache is allocation from CMA area and every allocation successes.
Please inform me if you need any information.
2014-08-15 오전 6:22, Andrew Morton 쓴 글:
> On Thu, 14 Aug 2014 14:15:40 +0900 Gioh Kim <[email protected]> wrote:
>
>> A buffer cache is allocated from movable area
>> because it is referred for a while and released soon.
>> But some filesystems are taking buffer cache for a long time
>> and it can disturb page migration.
>>
>> A new API should be introduced to allocate buffer cache from
>> non-movable area.
>
> I think the API could and should be more flexible than this.
>
> Rather than making the API be "movable or not movable", let's permit
> callers to specify the gfp_t and leave it at that. That way, if
> someone later wants to allocate a buffer head with, I dunno,
> __GFP_NOTRACK then they can do so.
>
> So the word "movable" shouldn't appear in buffer.c at all, except in a
> single place.
Absolutely I agree with you.
If filesystem developers agree this patch I will send 2nd patch that applies your ideas.
Thank you for your advices.
>
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -993,7 +993,7 @@ init_page_buffers(struct page *page, struct block_device *bdev,
>> */
>> static int
>> grow_dev_page(struct block_device *bdev, sector_t block,
>> - pgoff_t index, int size, int sizebits)
>> + pgoff_t index, int size, int sizebits, gfp_t movable_mask)
>
> s/movable_mask/gfp/
I got it.
>
>> {
>> struct inode *inode = bdev->bd_inode;
>> struct page *page;
>> @@ -1003,7 +1003,8 @@ grow_dev_page(struct block_device *bdev, sector_t block,
>> gfp_t gfp_mask;
>>
>> gfp_mask = mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS;
>> - gfp_mask |= __GFP_MOVABLE;
>> + if (movable_mask & __GFP_MOVABLE)
>> + gfp_mask |= __GFP_MOVABLE;
>
> This becomes
>
> gfp_mask |= gfp;
I got it.
>
>> /*
>> * XXX: __getblk_slow() can not really deal with failure and
>> * will endlessly loop on improvised global reclaim. Prefer
>> @@ -1058,7 +1059,8 @@ failed:
>> * that page was dirty, the buffers are set dirty also.
>> */
>> static int
>> -grow_buffers(struct block_device *bdev, sector_t block, int size)
>> +grow_buffers(struct block_device *bdev, sector_t block,
>> + int size, gfp_t movable_mask)
>
> gfp
>
>> {
>> pgoff_t index;
>> int sizebits;
>> @@ -1085,11 +1087,12 @@ grow_buffers(struct block_device *bdev, sector_t block, int size)
>> }
>>
>> /* Create a page with the proper size buffers.. */
>> - return grow_dev_page(bdev, block, index, size, sizebits);
>> + return grow_dev_page(bdev, block, index, size, sizebits, movable_mask);
>> }
>>
>> static struct buffer_head *
>> -__getblk_slow(struct block_device *bdev, sector_t block, int size)
>> +__getblk_slow(struct block_device *bdev, sector_t block,
>> + int size, gfp_t movable_mask)
>
> gfp
>
>> {
>> /* Size must be multiple of hard sectorsize */
>> if (unlikely(size & (bdev_logical_block_size(bdev)-1) ||
>> @@ -1111,7 +1114,7 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
>> if (bh)
>> return bh;
>>
>> - ret = grow_buffers(bdev, block, size);
>> + ret = grow_buffers(bdev, block, size, movable_mask);
>
> gfp
>
>> if (ret < 0)
>> return NULL;
>> if (ret == 0)
>> @@ -1385,11 +1388,34 @@ __getblk(struct block_device *bdev, sector_t block, unsigned size)
>>
>> might_sleep();
>> if (bh == NULL)
>> - bh = __getblk_slow(bdev, block, size);
>> + bh = __getblk_slow(bdev, block, size, __GFP_MOVABLE);
>
> Here is the place where buffer.c. mentions "movable".
I got it.
>
>> return bh;
>> }
>> EXPORT_SYMBOL(__getblk);
>>
>> + /*
>> + * __getblk_nonmovable will locate (and, if necessary, create) the buffer_head
>> + * which corresponds to the passed block_device, block and size. The
>> + * returned buffer has its reference count incremented.
>> + *
>> + * The page cache is allocated from non-movable area
>> + * not to prevent page migration.
>> + *
>> + * __getblk()_nonmovable will lock up the machine
>> + * if grow_dev_page's try_to_free_buffers() attempt is failing. FIXME, perhaps?
>> + */
>> +struct buffer_head *
>> +__getblk_nonmovable(struct block_device *bdev, sector_t block, unsigned size)
>> +{
>> + struct buffer_head *bh = __find_get_block(bdev, block, size);
>> +
>> + might_sleep();
>> + if (bh == NULL)
>> + bh = __getblk_slow(bdev, block, size, 0);
>> + return bh;
>> +}
>> +EXPORT_SYMBOL(__getblk_nonmovable);
>
> Suggest this be called __getblk_gfp(bdev, block, size, gfp) and then
> __getblk() be changed to call __getblk_gfp(..., __GFP_MOVABLE).
>
> We could then write a __getblk_nonmovable() which calls __getblk_gfp()
> (a static inlined one-line function) or we can just call
> __getblk_gfp(..., 0) directly from filesystems.
I got it.
>
>> @@ -1423,6 +1450,28 @@ __bread(struct block_device *bdev, sector_t block, unsigned size)
>> }
>> EXPORT_SYMBOL(__bread);
>>
>> +/**
>> + * __bread_nonmovable() - reads a specified block and returns the bh
>> + * @bdev: the block_device to read from
>> + * @block: number of block
>> + * @size: size (in bytes) to read
>> + *
>> + * Reads a specified block, and returns buffer head that contains it.
>> + * The page cache is allocated from non-movable area
>> + * not to prevent page migration.
>> + * It returns NULL if the block was unreadable.
>> + */
>> +struct buffer_head *
>> +__bread_nonmovable(struct block_device *bdev, sector_t block, unsigned size)
>> +{
>> + struct buffer_head *bh = __getblk_slow(bdev, block, size, 0);
>> +
>> + if (likely(bh) && !buffer_uptodate(bh))
>> + bh = __bread_slow(bh);
>> + return bh;
>> +}
>> +EXPORT_SYMBOL(__bread_nonmovable);
>
> Treat this in the same fashion as __getblk_nonmovable().
I got it.
>
>
>
On Mon, Aug 18, 2014 at 10:15:32AM +0900, Gioh Kim wrote:
>
> My test platform has totally 1GB memory, 256MB for CMA and 768MB for normal.
> I applied Joonsoo's patch: https://lkml.org/lkml/2014/5/28/64, so that
> 3/4 of allocation take place in normal area and 1/4 allocation take place in CMA area.
>
> And my platform has 4 ext4 partitions. Each ext4 partition has 2 page caches for superblock that
> are what this patch tries to move to out of CMA area.
> Therefore there are 8 page caches (8 pages size) that can prevent page migration.
Yes, but are you actually *using* the ext4 partitions for anything?
If this is a realistic real world use case, file systems are used to
store, well, files, and that means there will be inodes and dentry
cache entries that will also be allocated. Does your test scenario
reflect real world usage?
Cheers,
- Ted
2014-08-18 오후 12:24, Theodore Ts'o 쓴 글:
> On Mon, Aug 18, 2014 at 10:15:32AM +0900, Gioh Kim wrote:
>>
>> My test platform has totally 1GB memory, 256MB for CMA and 768MB for normal.
>> I applied Joonsoo's patch: https://lkml.org/lkml/2014/5/28/64, so that
>> 3/4 of allocation take place in normal area and 1/4 allocation take place in CMA area.
>>
>> And my platform has 4 ext4 partitions. Each ext4 partition has 2 page caches for superblock that
>> are what this patch tries to move to out of CMA area.
>> Therefore there are 8 page caches (8 pages size) that can prevent page migration.
>
> Yes, but are you actually *using* the ext4 partitions for anything?
> If this is a realistic real world use case, file systems are used to
> store, well, files, and that means there will be inodes and dentry
> cache entries that will also be allocated. Does your test scenario
> reflect real world usage?
Yes. I'm working for LG Electronics.
My test platform is currently selling item in the market.
And also I test my patch when my platform is working as if real user uses it.
I think the page caches of the inodes and dentry are held for short time.
I can see pairs of get_bh and put_bh in inodes/dentry handling.
I think inodes is allocated by kmem_cache_alloc in ext4_alloc_inode().
It is non-movable area allocation.
>
> Cheers,
>
> - Ted
>
On Mon 18-08-14 13:44:38, Gioh Kim wrote:
>
>
> 2014-08-18 오후 12:24, Theodore Ts'o 쓴 글:
> >On Mon, Aug 18, 2014 at 10:15:32AM +0900, Gioh Kim wrote:
> >>
> >>My test platform has totally 1GB memory, 256MB for CMA and 768MB for normal.
> >>I applied Joonsoo's patch: https://lkml.org/lkml/2014/5/28/64, so that
> >>3/4 of allocation take place in normal area and 1/4 allocation take place in CMA area.
> >>
> >>And my platform has 4 ext4 partitions. Each ext4 partition has 2 page caches for superblock that
> >>are what this patch tries to move to out of CMA area.
> >>Therefore there are 8 page caches (8 pages size) that can prevent page migration.
> >
> >Yes, but are you actually *using* the ext4 partitions for anything?
> >If this is a realistic real world use case, file systems are used to
> >store, well, files, and that means there will be inodes and dentry
> >cache entries that will also be allocated. Does your test scenario
> >reflect real world usage?
>
> Yes. I'm working for LG Electronics.
> My test platform is currently selling item in the market.
> And also I test my patch when my platform is working as if real user uses it.
Great, this is exactly what I was looking for. So if it really makes
your usecase work I don't have objections to your solution (after you
change what Andrew suggested).
> I think the page caches of the inodes and dentry are held for short time.
> I can see pairs of get_bh and put_bh in inodes/dentry handling.
>
> I think inodes is allocated by kmem_cache_alloc in ext4_alloc_inode().
> It is non-movable area allocation.
I guess the hold time depends on what storage do you use in your product
(flash?) and how loaded it is.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, 18 Aug 2014 14:11:46 +0200 Jan Kara <[email protected]> wrote:
> > >Yes, but are you actually *using* the ext4 partitions for anything?
> > >If this is a realistic real world use case, file systems are used to
> > >store, well, files, and that means there will be inodes and dentry
> > >cache entries that will also be allocated. Does your test scenario
> > >reflect real world usage?
> >
> > Yes. I'm working for LG Electronics.
> > My test platform is currently selling item in the market.
> > And also I test my patch when my platform is working as if real user uses it.
> Great, this is exactly what I was looking for. So if it really makes
> your usecase work I don't have objections to your solution (after you
> change what Andrew suggested).
Yup. This sort os real-world usage information is terribly important
and the more you can tell us the better. It helps us all to allocate
our time to the places where it is most needed...