There has been a lot of discussion recently to support devices and fs for
bs > ps. One of the main plumbing to support buffered IO is to have a minimum
order while allocating folios in the page cache.
Hannes sent recently a series[1] where he deduces the minimum folio
order based on the i_blkbits in struct inode. This takes a different
approach based on the discussion in that thread where the minimum and
maximum folio order can be set individually per inode.
This series is based on top of Christoph's patches to have iomap aops
for the block cache[2]. I rebased his remaining patches to
next-20230621. The whole tree can be found here[3].
Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a buffered
IO on a nvme drive with bs>ps in QEMU without any issues:
[root@archlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size
16384
[root@archlinux ~]# fio -bs=16k -iodepth=8 -rw=write -ioengine=io_uring -size=500M
-name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5
io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8
fio-3.34
Starting 1 process
Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s]
io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21 07:58:29 2023
read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec)
<snip>
Run status group 0 (all jobs):
READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec
WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec
Disk stats (read/write):
nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441, in_queue=12725, util=96.27%
One of the main dependency to work on a block device with bs>ps is
Christoph's work on converting block device aops to use iomap.
[1] https://lwn.net/Articles/934651/
[2] https://lwn.net/ml/linux-kernel/[email protected]/
[3] https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1
Luis Chamberlain (1):
block: set mapping order for the block cache in set_init_blocksize
Matthew Wilcox (Oracle) (1):
fs: Allow fine-grained control of folio sizes
Pankaj Raghav (2):
filemap: use minimum order while allocating folios
nvme: enable logical block size > PAGE_SIZE
block/bdev.c | 9 ++++++++
drivers/nvme/host/core.c | 2 +-
include/linux/pagemap.h | 46 ++++++++++++++++++++++++++++++++++++----
mm/filemap.c | 9 +++++---
mm/readahead.c | 34 ++++++++++++++++++++---------
5 files changed, 82 insertions(+), 18 deletions(-)
--
2.39.2
From: Luis Chamberlain <[email protected]>
Automatically set the minimum mapping order for block devices in
set_init_blocksize(). The mapping order will be set only when the block
device uses iomap based aops.
Signed-off-by: Pankaj Raghav <[email protected]>
Signed-off-by: Luis Chamberlain <[email protected]>
---
block/bdev.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/block/bdev.c b/block/bdev.c
index 9bb54d9d02a6..db8cede8a320 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -126,6 +126,7 @@ static void set_init_blocksize(struct block_device *bdev)
{
unsigned int bsize = bdev_logical_block_size(bdev);
loff_t size = i_size_read(bdev->bd_inode);
+ int order, folio_order;
while (bsize < PAGE_SIZE) {
if (size & bsize)
@@ -133,6 +134,14 @@ static void set_init_blocksize(struct block_device *bdev)
bsize <<= 1;
}
bdev->bd_inode->i_blkbits = blksize_bits(bsize);
+ order = bdev->bd_inode->i_blkbits - PAGE_SHIFT;
+ folio_order = mapping_min_folio_order(bdev->bd_inode->i_mapping);
+
+ if (!IS_ENABLED(CONFIG_BUFFER_HEAD)) {
+ /* Do not allow changing the folio order after it is set */
+ WARN_ON_ONCE(folio_order && (folio_order != order));
+ mapping_set_folio_orders(bdev->bd_inode->i_mapping, order, 31);
+ }
}
int set_blocksize(struct block_device *bdev, int size)
--
2.39.2
On 6/21/23 10:38, Pankaj Raghav wrote:
> There has been a lot of discussion recently to support devices and fs for
> bs > ps. One of the main plumbing to support buffered IO is to have a minimum
> order while allocating folios in the page cache.
>
> Hannes sent recently a series[1] where he deduces the minimum folio
> order based on the i_blkbits in struct inode. This takes a different
> approach based on the discussion in that thread where the minimum and
> maximum folio order can be set individually per inode.
>
> This series is based on top of Christoph's patches to have iomap aops
> for the block cache[2]. I rebased his remaining patches to
> next-20230621. The whole tree can be found here[3].
>
> Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a buffered
> IO on a nvme drive with bs>ps in QEMU without any issues:
>
> [root@archlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size
> 16384
> [root@archlinux ~]# fio -bs=16k -iodepth=8 -rw=write -ioengine=io_uring -size=500M
> -name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5
> io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8
> fio-3.34
> Starting 1 process
> Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s]
> io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21 07:58:29 2023
> read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec)
> <snip>
> Run status group 0 (all jobs):
> READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec
> WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec
>
> Disk stats (read/write):
> nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441, in_queue=12725, util=96.27%
>
> One of the main dependency to work on a block device with bs>ps is
> Christoph's work on converting block device aops to use iomap.
>
> [1] https://lwn.net/Articles/934651/
> [2] https://lwn.net/ml/linux-kernel/[email protected]/
> [3] https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1
>
> Luis Chamberlain (1):
> block: set mapping order for the block cache in set_init_blocksize
>
> Matthew Wilcox (Oracle) (1):
> fs: Allow fine-grained control of folio sizes
>
> Pankaj Raghav (2):
> filemap: use minimum order while allocating folios
> nvme: enable logical block size > PAGE_SIZE
>
> block/bdev.c | 9 ++++++++
> drivers/nvme/host/core.c | 2 +-
> include/linux/pagemap.h | 46 ++++++++++++++++++++++++++++++++++++----
> mm/filemap.c | 9 +++++---
> mm/readahead.c | 34 ++++++++++++++++++++---------
> 5 files changed, 82 insertions(+), 18 deletions(-)
>
Hmm. Most unfortunate; I've just finished my own patchset (duplicating
much of this work) to get 'brd' running with large folios.
And it even works this time, 'fsx' from the xfstest suite runs happily
on that.
Guess we'll need to reconcile our patches.
Cheers,
Hannes
On 6/21/23 10:38, Pankaj Raghav wrote:
> From: Luis Chamberlain <[email protected]>
>
> Automatically set the minimum mapping order for block devices in
> set_init_blocksize(). The mapping order will be set only when the block
> device uses iomap based aops.
>
> Signed-off-by: Pankaj Raghav <[email protected]>
> Signed-off-by: Luis Chamberlain <[email protected]>
> ---
> block/bdev.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/block/bdev.c b/block/bdev.c
> index 9bb54d9d02a6..db8cede8a320 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -126,6 +126,7 @@ static void set_init_blocksize(struct block_device *bdev)
> {
> unsigned int bsize = bdev_logical_block_size(bdev);
> loff_t size = i_size_read(bdev->bd_inode);
> + int order, folio_order;
>
> while (bsize < PAGE_SIZE) {
> if (size & bsize)
> @@ -133,6 +134,14 @@ static void set_init_blocksize(struct block_device *bdev)
> bsize <<= 1;
> }
> bdev->bd_inode->i_blkbits = blksize_bits(bsize);
> + order = bdev->bd_inode->i_blkbits - PAGE_SHIFT;
> + folio_order = mapping_min_folio_order(bdev->bd_inode->i_mapping);
> +
> + if (!IS_ENABLED(CONFIG_BUFFER_HEAD)) {
> + /* Do not allow changing the folio order after it is set */
> + WARN_ON_ONCE(folio_order && (folio_order != order));
> + mapping_set_folio_orders(bdev->bd_inode->i_mapping, order, 31);
> + }
> }
>
> int set_blocksize(struct block_device *bdev, int size)
This really has nothing to do with buffer heads.
In fact, I've got a patchset to make it work _with_ buffer heads.
So please, don't make it conditional on CONFIG_BUFFER_HEAD.
And we should be calling into 'mapping_set_folio_order()' only if the
'order' argument is larger than PAGE_ORDER, otherwise we end up enabling
large folio support for _every_ block device.
Which I doubt we want.
Cheers,
Hannes
On 6/21/23 12:42, Pankaj Raghav wrote:
>>> bdev->bd_inode->i_blkbits = blksize_bits(bsize);
>>> + order = bdev->bd_inode->i_blkbits - PAGE_SHIFT;
>>> + folio_order = mapping_min_folio_order(bdev->bd_inode->i_mapping);
>>> +
>>> + if (!IS_ENABLED(CONFIG_BUFFER_HEAD)) {
>>> + /* Do not allow changing the folio order after it is set */
>>> + WARN_ON_ONCE(folio_order && (folio_order != order));
>>> + mapping_set_folio_orders(bdev->bd_inode->i_mapping, order, 31);
>>> + }
>>> }
>>> int set_blocksize(struct block_device *bdev, int size)
>> This really has nothing to do with buffer heads.
>>
>> In fact, I've got a patchset to make it work _with_ buffer heads.
>>
>> So please, don't make it conditional on CONFIG_BUFFER_HEAD.
>>
>> And we should be calling into 'mapping_set_folio_order()' only if the 'order' argument is larger
>> than PAGE_ORDER, otherwise we end up enabling
>> large folio support for _every_ block device.
>> Which I doubt we want.
>>
>
> Hmm, which aops are you using for the block device? If you are using the old aops, then we will be
> using helpers from buffer.c and mpage.c which do not support large folios. I am getting a BUG_ON
> when I don't use iomap based aops for the block device:
>
I know. I haven't said that mpage.c / buffer.c support large folios
_now_. All I'm saying is that I have a patchset enabling it to support
large folios :-)
Cheers,
Hannes
>> bdev->bd_inode->i_blkbits = blksize_bits(bsize);
>> + order = bdev->bd_inode->i_blkbits - PAGE_SHIFT;
>> + folio_order = mapping_min_folio_order(bdev->bd_inode->i_mapping);
>> +
>> + if (!IS_ENABLED(CONFIG_BUFFER_HEAD)) {
>> + /* Do not allow changing the folio order after it is set */
>> + WARN_ON_ONCE(folio_order && (folio_order != order));
>> + mapping_set_folio_orders(bdev->bd_inode->i_mapping, order, 31);
>> + }
>> }
>> int set_blocksize(struct block_device *bdev, int size)
> This really has nothing to do with buffer heads.
>
> In fact, I've got a patchset to make it work _with_ buffer heads.
>
> So please, don't make it conditional on CONFIG_BUFFER_HEAD.
>
> And we should be calling into 'mapping_set_folio_order()' only if the 'order' argument is larger
> than PAGE_ORDER, otherwise we end up enabling
> large folio support for _every_ block device.
> Which I doubt we want.
>
Hmm, which aops are you using for the block device? If you are using the old aops, then we will be
using helpers from buffer.c and mpage.c which do not support large folios. I am getting a BUG_ON
when I don't use iomap based aops for the block device:
[ 11.596239] kernel BUG at fs/buffer.c:2384!
[ 11.596609] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[ 11.597064] CPU: 3 PID: 10 Comm: kworker/u8:0 Not tainted
6.4.0-rc7-next-20230621-00010-g87171074c649-dirty #183
[ 11.597934] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 11.598882] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[ 11.599370] RIP: 0010:block_read_full_folio+0x70d/0x8f0
Let me know what you think!
> Cheers,
>
> Hannes
>
>
>>
>> Hmm, which aops are you using for the block device? If you are using the old aops, then we will be
>> using helpers from buffer.c and mpage.c which do not support large folios. I am getting a BUG_ON
>> when I don't use iomap based aops for the block device:
>>
> I know. I haven't said that mpage.c / buffer.c support large folios _now_. All I'm saying is that I
> have a patchset enabling it to support large folios :-)
>
Ah ok! I thought we are not going that route based on the discussion we had in LSF.
> Cheers,
>
> Hannes
>
On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
> On 6/21/23 10:38, Pankaj Raghav wrote:
> > There has been a lot of discussion recently to support devices and fs for
> > bs > ps. One of the main plumbing to support buffered IO is to have a minimum
> > order while allocating folios in the page cache.
> >
> > Hannes sent recently a series[1] where he deduces the minimum folio
> > order based on the i_blkbits in struct inode. This takes a different
> > approach based on the discussion in that thread where the minimum and
> > maximum folio order can be set individually per inode.
> >
> > This series is based on top of Christoph's patches to have iomap aops
> > for the block cache[2]. I rebased his remaining patches to
> > next-20230621. The whole tree can be found here[3].
> >
> > Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a buffered
> > IO on a nvme drive with bs>ps in QEMU without any issues:
> >
> > [root@archlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size
> > 16384
> > [root@archlinux ~]# fio -bs=16k -iodepth=8 -rw=write -ioengine=io_uring -size=500M
> > -name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5
> > io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8
> > fio-3.34
> > Starting 1 process
> > Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s]
> > io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21 07:58:29 2023
> > read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec)
> > <snip>
> > Run status group 0 (all jobs):
> > READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec
> > WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec
> >
> > Disk stats (read/write):
> > nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441, in_queue=12725, util=96.27%
> >
> > One of the main dependency to work on a block device with bs>ps is
> > Christoph's work on converting block device aops to use iomap.
> >
> > [1] https://lwn.net/Articles/934651/
> > [2] https://lwn.net/ml/linux-kernel/[email protected]/
> > [3] https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1
> >
> > Luis Chamberlain (1):
> > block: set mapping order for the block cache in set_init_blocksize
> >
> > Matthew Wilcox (Oracle) (1):
> > fs: Allow fine-grained control of folio sizes
> >
> > Pankaj Raghav (2):
> > filemap: use minimum order while allocating folios
> > nvme: enable logical block size > PAGE_SIZE
> >
> > block/bdev.c | 9 ++++++++
> > drivers/nvme/host/core.c | 2 +-
> > include/linux/pagemap.h | 46 ++++++++++++++++++++++++++++++++++++----
> > mm/filemap.c | 9 +++++---
> > mm/readahead.c | 34 ++++++++++++++++++++---------
> > 5 files changed, 82 insertions(+), 18 deletions(-)
> >
>
> Hmm. Most unfortunate; I've just finished my own patchset (duplicating much
> of this work) to get 'brd' running with large folios.
> And it even works this time, 'fsx' from the xfstest suite runs happily on
> that.
So you've converted a filesystem to use bs > ps, too? Or is the
filesystem that fsx is running on just using normal 4kB block size?
If the latter, then fsx is not actually testing the large folio page
cache support, it's mostly just doing 4kB aligned IO to brd....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 6/22/23 00:07, Dave Chinner wrote:
> On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
>> On 6/21/23 10:38, Pankaj Raghav wrote:
>>> There has been a lot of discussion recently to support devices and fs for
>>> bs > ps. One of the main plumbing to support buffered IO is to have a minimum
>>> order while allocating folios in the page cache.
>>>
>>> Hannes sent recently a series[1] where he deduces the minimum folio
>>> order based on the i_blkbits in struct inode. This takes a different
>>> approach based on the discussion in that thread where the minimum and
>>> maximum folio order can be set individually per inode.
>>>
>>> This series is based on top of Christoph's patches to have iomap aops
>>> for the block cache[2]. I rebased his remaining patches to
>>> next-20230621. The whole tree can be found here[3].
>>>
>>> Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a buffered
>>> IO on a nvme drive with bs>ps in QEMU without any issues:
>>>
>>> [root@archlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size
>>> 16384
>>> [root@archlinux ~]# fio -bs=16k -iodepth=8 -rw=write -ioengine=io_uring -size=500M
>>> -name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5
>>> io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8
>>> fio-3.34
>>> Starting 1 process
>>> Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s]
>>> io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21 07:58:29 2023
>>> read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec)
>>> <snip>
>>> Run status group 0 (all jobs):
>>> READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec
>>> WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec
>>>
>>> Disk stats (read/write):
>>> nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441, in_queue=12725, util=96.27%
>>>
>>> One of the main dependency to work on a block device with bs>ps is
>>> Christoph's work on converting block device aops to use iomap.
>>>
>>> [1] https://lwn.net/Articles/934651/
>>> [2] https://lwn.net/ml/linux-kernel/[email protected]/
>>> [3] https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1
>>>
>>> Luis Chamberlain (1):
>>> block: set mapping order for the block cache in set_init_blocksize
>>>
>>> Matthew Wilcox (Oracle) (1):
>>> fs: Allow fine-grained control of folio sizes
>>>
>>> Pankaj Raghav (2):
>>> filemap: use minimum order while allocating folios
>>> nvme: enable logical block size > PAGE_SIZE
>>>
>>> block/bdev.c | 9 ++++++++
>>> drivers/nvme/host/core.c | 2 +-
>>> include/linux/pagemap.h | 46 ++++++++++++++++++++++++++++++++++++----
>>> mm/filemap.c | 9 +++++---
>>> mm/readahead.c | 34 ++++++++++++++++++++---------
>>> 5 files changed, 82 insertions(+), 18 deletions(-)
>>>
>>
>> Hmm. Most unfortunate; I've just finished my own patchset (duplicating much
>> of this work) to get 'brd' running with large folios.
>> And it even works this time, 'fsx' from the xfstest suite runs happily on
>> that.
>
> So you've converted a filesystem to use bs > ps, too? Or is the
> filesystem that fsx is running on just using normal 4kB block size?
> If the latter, then fsx is not actually testing the large folio page
> cache support, it's mostly just doing 4kB aligned IO to brd....
>
I have been running fsx on an xfs with bs=16k, and it worked like a charm.
I'll try to run the xfstest suite once I'm finished with merging
Pankajs patches into my patchset.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman
On 6/22/23 07:51, Hannes Reinecke wrote:
> On 6/22/23 00:07, Dave Chinner wrote:
>> On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
>>> On 6/21/23 10:38, Pankaj Raghav wrote:
>>>> There has been a lot of discussion recently to support devices and
>>>> fs for
>>>> bs > ps. One of the main plumbing to support buffered IO is to have
>>>> a minimum
>>>> order while allocating folios in the page cache.
>>>>
>>>> Hannes sent recently a series[1] where he deduces the minimum folio
>>>> order based on the i_blkbits in struct inode. This takes a different
>>>> approach based on the discussion in that thread where the minimum and
>>>> maximum folio order can be set individually per inode.
>>>>
>>>> This series is based on top of Christoph's patches to have iomap aops
>>>> for the block cache[2]. I rebased his remaining patches to
>>>> next-20230621. The whole tree can be found here[3].
>>>>
>>>> Compiling the tree with CONFIG_BUFFER_HEAD=n, I am able to do a
>>>> buffered
>>>> IO on a nvme drive with bs>ps in QEMU without any issues:
>>>>
>>>> [root@archlinux ~]# cat /sys/block/nvme0n2/queue/logical_block_size
>>>> 16384
>>>> [root@archlinux ~]# fio -bs=16k -iodepth=8 -rw=write
>>>> -ioengine=io_uring -size=500M
>>>> -name=io_uring_1 -filename=/dev/nvme0n2 -verify=md5
>>>> io_uring_1: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W)
>>>> 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=io_uring, iodepth=8
>>>> fio-3.34
>>>> Starting 1 process
>>>> Jobs: 1 (f=1): [V(1)][100.0%][r=336MiB/s][r=21.5k IOPS][eta 00m:00s]
>>>> io_uring_1: (groupid=0, jobs=1): err= 0: pid=285: Wed Jun 21
>>>> 07:58:29 2023
>>>> read: IOPS=27.3k, BW=426MiB/s (447MB/s)(500MiB/1174msec)
>>>> <snip>
>>>> Run status group 0 (all jobs):
>>>> READ: bw=426MiB/s (447MB/s), 426MiB/s-426MiB/s
>>>> (447MB/s-447MB/s), io=500MiB (524MB), run=1174-1174msec
>>>> WRITE: bw=198MiB/s (207MB/s), 198MiB/s-198MiB/s
>>>> (207MB/s-207MB/s), io=500MiB (524MB), run=2527-2527msec
>>>>
>>>> Disk stats (read/write):
>>>> nvme0n2: ios=35614/4297, merge=0/0, ticks=11283/1441,
>>>> in_queue=12725, util=96.27%
>>>>
>>>> One of the main dependency to work on a block device with bs>ps is
>>>> Christoph's work on converting block device aops to use iomap.
>>>>
>>>> [1] https://lwn.net/Articles/934651/
>>>> [2] https://lwn.net/ml/linux-kernel/[email protected]/
>>>> [3]
>>>> https://github.com/Panky-codes/linux/tree/next-20230523-filemap-order-generic-v1
>>>>
>>>> Luis Chamberlain (1):
>>>> block: set mapping order for the block cache in set_init_blocksize
>>>>
>>>> Matthew Wilcox (Oracle) (1):
>>>> fs: Allow fine-grained control of folio sizes
>>>>
>>>> Pankaj Raghav (2):
>>>> filemap: use minimum order while allocating folios
>>>> nvme: enable logical block size > PAGE_SIZE
>>>>
>>>> block/bdev.c | 9 ++++++++
>>>> drivers/nvme/host/core.c | 2 +-
>>>> include/linux/pagemap.h | 46
>>>> ++++++++++++++++++++++++++++++++++++----
>>>> mm/filemap.c | 9 +++++---
>>>> mm/readahead.c | 34 ++++++++++++++++++++---------
>>>> 5 files changed, 82 insertions(+), 18 deletions(-)
>>>>
>>>
>>> Hmm. Most unfortunate; I've just finished my own patchset
>>> (duplicating much
>>> of this work) to get 'brd' running with large folios.
>>> And it even works this time, 'fsx' from the xfstest suite runs
>>> happily on
>>> that.
>>
>> So you've converted a filesystem to use bs > ps, too? Or is the
>> filesystem that fsx is running on just using normal 4kB block size?
>> If the latter, then fsx is not actually testing the large folio page
>> cache support, it's mostly just doing 4kB aligned IO to brd....
>>
> I have been running fsx on an xfs with bs=16k, and it worked like a charm.
> I'll try to run the xfstest suite once I'm finished with merging
> Pankajs patches into my patchset.
> Well, would've been too easy.
'fsx' bails out at test 27 (collapse), with:
XFS (ram0): Corruption detected. Unmount and run xfs_repair
XFS (ram0): Internal error isnullstartblock(got.br_startblock) at line
5787 of file fs/xfs/libxfs/xfs_bmap.c. Caller
xfs_bmap_collapse_extents+0x2d9/0x320 [xfs]
Guess some more work needs to be done here.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman
On Thu, Jun 22, 2023 at 08:50:06AM +0200, Hannes Reinecke wrote:
> On 6/22/23 07:51, Hannes Reinecke wrote:
> > On 6/22/23 00:07, Dave Chinner wrote:
> > > On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
> > > > On 6/21/23 10:38, Pankaj Raghav wrote:
> > > > Hmm. Most unfortunate; I've just finished my own patchset
> > > > (duplicating much
> > > > of this work) to get 'brd' running with large folios.
> > > > And it even works this time, 'fsx' from the xfstest suite runs
> > > > happily on
> > > > that.
> > >
> > > So you've converted a filesystem to use bs > ps, too? Or is the
> > > filesystem that fsx is running on just using normal 4kB block size?
> > > If the latter, then fsx is not actually testing the large folio page
> > > cache support, it's mostly just doing 4kB aligned IO to brd....
> > >
> > I have been running fsx on an xfs with bs=16k, and it worked like a charm.
> > I'll try to run the xfstest suite once I'm finished with merging
> > Pankajs patches into my patchset.
> > Well, would've been too easy.
> 'fsx' bails out at test 27 (collapse), with:
>
> XFS (ram0): Corruption detected. Unmount and run xfs_repair
> XFS (ram0): Internal error isnullstartblock(got.br_startblock) at line 5787
> of file fs/xfs/libxfs/xfs_bmap.c. Caller
> xfs_bmap_collapse_extents+0x2d9/0x320 [xfs]
>
> Guess some more work needs to be done here.
Yup, start by trying to get the fstests that run fsx through cleanly
first. That'll get you through the first 100,000 or so test ops
in a few different run configs. Those canned tests are:
tests/generic/075
tests/generic/112
tests/generic/127
tests/generic/231
tests/generic/455
tests/generic/457
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 6/22/23 12:20, Dave Chinner wrote:
> On Thu, Jun 22, 2023 at 08:50:06AM +0200, Hannes Reinecke wrote:
>> On 6/22/23 07:51, Hannes Reinecke wrote:
>>> On 6/22/23 00:07, Dave Chinner wrote:
>>>> On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
>>>>> On 6/21/23 10:38, Pankaj Raghav wrote:
>>>>> Hmm. Most unfortunate; I've just finished my own patchset
>>>>> (duplicating much
>>>>> of this work) to get 'brd' running with large folios.
>>>>> And it even works this time, 'fsx' from the xfstest suite runs
>>>>> happily on
>>>>> that.
>>>>
>>>> So you've converted a filesystem to use bs > ps, too? Or is the
>>>> filesystem that fsx is running on just using normal 4kB block size?
>>>> If the latter, then fsx is not actually testing the large folio page
>>>> cache support, it's mostly just doing 4kB aligned IO to brd....
>>>>
>>> I have been running fsx on an xfs with bs=16k, and it worked like a charm.
>>> I'll try to run the xfstest suite once I'm finished with merging
>>> Pankajs patches into my patchset.
>>> Well, would've been too easy.
>> 'fsx' bails out at test 27 (collapse), with:
>>
>> XFS (ram0): Corruption detected. Unmount and run xfs_repair
>> XFS (ram0): Internal error isnullstartblock(got.br_startblock) at line 5787
>> of file fs/xfs/libxfs/xfs_bmap.c. Caller
>> xfs_bmap_collapse_extents+0x2d9/0x320 [xfs]
>>
>> Guess some more work needs to be done here.
>
> Yup, start by trying to get the fstests that run fsx through cleanly
> first. That'll get you through the first 100,000 or so test ops
> in a few different run configs. Those canned tests are:
>
> tests/generic/075
> tests/generic/112
> tests/generic/127
> tests/generic/231
> tests/generic/455
> tests/generic/457
>
THX.
Any preferences for the filesystem size?
I'm currently running off two ramdisks with 512M each; if that's too
small I need to increase the memory of the VM ...
Cheers,
Hannes
On Thu, Jun 22, 2023 at 12:23:10PM +0200, Hannes Reinecke wrote:
> On 6/22/23 12:20, Dave Chinner wrote:
> > On Thu, Jun 22, 2023 at 08:50:06AM +0200, Hannes Reinecke wrote:
> > > On 6/22/23 07:51, Hannes Reinecke wrote:
> > > > On 6/22/23 00:07, Dave Chinner wrote:
> > > > > On Wed, Jun 21, 2023 at 11:00:24AM +0200, Hannes Reinecke wrote:
> > > > > > On 6/21/23 10:38, Pankaj Raghav wrote:
> > > > > > Hmm. Most unfortunate; I've just finished my own patchset
> > > > > > (duplicating much
> > > > > > of this work) to get 'brd' running with large folios.
> > > > > > And it even works this time, 'fsx' from the xfstest suite runs
> > > > > > happily on
> > > > > > that.
> > > > >
> > > > > So you've converted a filesystem to use bs > ps, too? Or is the
> > > > > filesystem that fsx is running on just using normal 4kB block size?
> > > > > If the latter, then fsx is not actually testing the large folio page
> > > > > cache support, it's mostly just doing 4kB aligned IO to brd....
> > > > >
> > > > I have been running fsx on an xfs with bs=16k, and it worked like a charm.
> > > > I'll try to run the xfstest suite once I'm finished with merging
> > > > Pankajs patches into my patchset.
> > > > Well, would've been too easy.
> > > 'fsx' bails out at test 27 (collapse), with:
> > >
> > > XFS (ram0): Corruption detected. Unmount and run xfs_repair
> > > XFS (ram0): Internal error isnullstartblock(got.br_startblock) at line 5787
> > > of file fs/xfs/libxfs/xfs_bmap.c. Caller
> > > xfs_bmap_collapse_extents+0x2d9/0x320 [xfs]
> > >
> > > Guess some more work needs to be done here.
> >
> > Yup, start by trying to get the fstests that run fsx through cleanly
> > first. That'll get you through the first 100,000 or so test ops
> > in a few different run configs. Those canned tests are:
> >
> > tests/generic/075
> > tests/generic/112
> > tests/generic/127
> > tests/generic/231
> > tests/generic/455
> > tests/generic/457
> >
> THX.
>
> Any preferences for the filesystem size?
> I'm currently running off two ramdisks with 512M each; if that's too small I
> need to increase the memory of the VM ...
I generally run my pmem/ramdisk VM on a pair of 8GB ramdisks for 4kB
filesystem testing.
Because you are using larger block sizes, you are going to want to
use larger rather than smaller because there are fewer blocks for a
given size, and metadata blocks hold many more records before they
spill to multiple nodes/levels.
e.g. going from 4kB to 16kB needs a 16x larger fs and file sizes for
the 16kB filesystem to exercise the same metadata tree depth
coverage as the 4kB filesystem (i.e. each single block extent is 4x
larger, each single block metadata block holds 4x as much metadata
before it spills).
With this in mind, I'd say you want the 16kB block size ramdisks to
be as large as you can make them when running fstests....
Cheers,
Dave.
--
Dave Chinner
[email protected]
Hello,
kernel test robot noticed "WARNING:at_block/bdev.c:#set_init_blocksize" on:
commit: 01671335e600baacb98c70cdb010d539e5f615c2 ("[RFC 3/4] block: set mapping order for the block cache in set_init_blocksize")
url: https://github.com/intel-lab-lkp/linux/commits/Pankaj-Raghav/fs-Allow-fine-grained-control-of-folio-sizes/20230621-172322
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [RFC 3/4] block: set mapping order for the block cache in set_init_blocksize
in testcase: boot
compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]
[ 20.303844][ T130] ------------[ cut here ]------------
[ 20.304791][ T130] WARNING: CPU: 1 PID: 130 at block/bdev.c:142 set_init_blocksize (block/bdev.c:142 (discriminator 1))
[ 20.306150][ T130] Modules linked in: sr_mod cdrom sg ata_generic bochs intel_rapl_msr ata_piix intel_rapl_common drm_vram_helper crct10dif_pclmul crc32_pclmul ppdev crc32c_intel ghash_clmulni_intel sha512_ssse3 drm_kms_helper rapl libata syscopyarea sysfillrect sysimgblt drm_ttm_helper ttm joydev ipmi_devintf parport_pc ipmi_msghandler serio_raw i2c_piix4 parport fuse drm ip_tables
[ 20.310835][ T130] CPU: 1 PID: 130 Comm: systemd-udevd Not tainted 6.4.0-rc4-00504-g01671335e600 #1
[ 20.312212][ T130] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 20.313760][ T130] RIP: 0010:set_init_blocksize (block/bdev.c:142 (discriminator 1))
[ 20.314621][ T130] Code: c1 e2 08 48 09 d0 48 0d 00 e0 03 00 48 89 86 98 00 00 00 c3 cc cc cc cc 3d ff 0f 00 00 0f 86 7a ff ff ff c1 e8 09 89 c6 eb 91 <0f> 0b eb cc ba 09 00 00 00 eb 96 66 66 2e 0f 1f 84 00 00 00 00 00
All code
========
0: c1 e2 08 shl $0x8,%edx
3: 48 09 d0 or %rdx,%rax
6: 48 0d 00 e0 03 00 or $0x3e000,%rax
c: 48 89 86 98 00 00 00 mov %rax,0x98(%rsi)
13: c3 retq
14: cc int3
15: cc int3
16: cc int3
17: cc int3
18: 3d ff 0f 00 00 cmp $0xfff,%eax
1d: 0f 86 7a ff ff ff jbe 0xffffffffffffff9d
23: c1 e8 09 shr $0x9,%eax
26: 89 c6 mov %eax,%esi
28: eb 91 jmp 0xffffffffffffffbb
2a:* 0f 0b ud2 <-- trapping instruction
2c: eb cc jmp 0xfffffffffffffffa
2e: ba 09 00 00 00 mov $0x9,%edx
33: eb 96 jmp 0xffffffffffffffcb
35: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
3c: 00 00 00 00
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: eb cc jmp 0xffffffffffffffd0
4: ba 09 00 00 00 mov $0x9,%edx
9: eb 96 jmp 0xffffffffffffffa1
b: 66 66 2e 0f 1f 84 00 data16 nopw %cs:0x0(%rax,%rax,1)
12: 00 00 00 00
[ 20.317420][ T130] RSP: 0000:ffffadd780567c48 EFLAGS: 00010282
[ 20.318345][ T130] RAX: 00000000fffffd00 RBX: ffff8d79b1670c00 RCX: 000000000000001d
[ 20.319432][ T130] RDX: 00000000fffffffd RSI: ffff8d79b16710f8 RDI: ffff8d79b1670c00
[ 20.320641][ T130] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000584843
[ 20.321827][ T130] R10: 0000000000000000 R11: 0000000000586c43 R12: ffff8d7998831400
[ 20.322912][ T130] R13: ffff8d7998831400 R14: ffff8d7a7aca2800 R15: ffff8d79988315f8
[ 20.324947][ T130] FS: 0000000000000000(0000) GS:ffff8d7cafd00000(0063) knlGS:00000000f7929b00
[ 20.326981][ T130] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 20.328939][ T130] CR2: 0000000059b91a8c CR3: 0000000117ac4000 CR4: 00000000000406e0
[ 20.330843][ T130] Call Trace:
[ 20.332322][ T130] <TASK>
[ 20.333553][ T130] ? set_init_blocksize (block/bdev.c:142 (discriminator 1))
[ 20.335087][ T130] ? __warn (kernel/panic.c:673)
[ 20.336555][ T130] ? set_init_blocksize (block/bdev.c:142 (discriminator 1))
[ 20.337984][ T130] ? report_bug (lib/bug.c:180 lib/bug.c:219)
[ 20.339449][ T130] ? handle_bug (arch/x86/kernel/traps.c:324)
[ 20.340945][ T130] ? exc_invalid_op (arch/x86/kernel/traps.c:345 (discriminator 1))
[ 20.342386][ T130] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:568)
[ 20.344022][ T130] ? set_init_blocksize (block/bdev.c:142 (discriminator 1))
[ 20.345530][ T130] blkdev_get_whole (block/bdev.c:626)
[ 20.346950][ T130] blkdev_get_by_dev (block/bdev.c:837)
[ 20.348464][ T130] ? __pfx_blkdev_open (block/fops.c:474)
[ 20.349941][ T130] blkdev_open (block/fops.c:494)
[ 20.351309][ T130] do_dentry_open (fs/open.c:920)
[ 20.352846][ T130] do_open (fs/namei.c:3639)
[ 20.354190][ T130] ? open_last_lookups (fs/namei.c:3583)
[ 20.355564][ T130] path_openat (fs/namei.c:3792)
[ 20.356950][ T130] do_filp_open (fs/namei.c:3818)
[ 20.358276][ T130] do_sys_openat2 (fs/open.c:1356)
[ 20.359601][ T130] __ia32_compat_sys_openat (fs/open.c:1430)
[ 20.361103][ T130] __do_fast_syscall_32 (arch/x86/entry/common.c:112 arch/x86/entry/common.c:178)
[ 20.362534][ T130] do_fast_syscall_32 (arch/x86/entry/common.c:203)
[ 20.363986][ T130] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:122)
[ 20.365697][ T130] RIP: 0023:0xf7f90589
[ 20.366940][ T130] Code: 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
All code
========
0: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi
...
20: 00 51 52 add %dl,0x52(%rcx)
23: 55 push %rbp
24:* 89 e5 mov %esp,%ebp <-- trapping instruction
26: 0f 34 sysenter
28: cd 80 int $0x80
2a: 5d pop %rbp
2b: 5a pop %rdx
2c: 59 pop %rcx
2d: c3 retq
2e: 90 nop
2f: 90 nop
30: 90 nop
31: 90 nop
32: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
39: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
Code starting with the faulting instruction
===========================================
0: 5d pop %rbp
1: 5a pop %rdx
2: 59 pop %rcx
3: c3 retq
4: 90 nop
5: 90 nop
6: 90 nop
7: 90 nop
8: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
f: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi
[ 20.370969][ T130] RSP: 002b:00000000ffd9b010 EFLAGS: 00200206 ORIG_RAX: 0000000000000127
[ 20.372809][ T130] RAX: ffffffffffffffda RBX: 00000000ffffff9c RCX: 00000000578c6540
[ 20.374652][ T130] RDX: 00000000000a8800 RSI: 0000000000000000 RDI: 000000005663b8ad
[ 20.376552][ T130] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 20.378381][ T130] R10: 0000000000000000 R11: 0000000000200206 R12: 0000000000000000
[ 20.380235][ T130] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 20.381888][ T130] </TASK>
[ 20.382968][ T130] ---[ end trace 0000000000000000 ]---
LKP: ttyS0: 194: /lkp/lkp/src/bin/run-lkp /lkp/jobs/scheduled/vm-meta-307/boot-1-debian-11.1-i386-20220923.cgz-01671335e600-20230622-46155-yec2o-1.yaml
[ 29.071854][ T211] is_virt=true
[ 29.071868][ T211]
[ 30.394619][ T211] lkp: kernel tainted state: 0
[ 30.394639][ T211]
[ 31.373739][ T211] LKP: stdout: 194: Kernel tests: Boot OK!
[ 31.373784][ T211]
LKP: ttyS0: 194: LKP: rebooting forcely
[ 35.497053][ T211] LKP: stdout: 194: HOSTNAME vm-snb, MAC 52:54:00:12:34:56, kernel 6.4.0-rc4-00504-g01671335e600 1
[ 35.497067][ T211]
[ 35.674956][ T211] install debs round one: dpkg -i --force-confdef --force-depends /opt/deb/gawk_1%3a5.1.0-1_i386.deb
[ 35.674975][ T211]
[ 35.679306][ T211] Selecting previously unselected package gawk.
[ 35.679315][ T211]
[ 35.684028][ T211] (Reading database ... 16439 files and directories currently installed.)
[ 35.684040][ T211]
[ 35.688683][ T211] Preparing to unpack .../deb/gawk_1%3a5.1.0-1_i386.deb ...
[ 35.688693][ T211]
[ 35.691823][ T211] Unpacking gawk (1:5.1.0-1) ...
[ 35.691830][ T211]
[ 35.694690][ T211] Setting up gawk (1:5.1.0-1) ...
[ 35.694697][ T211]
[ 35.696731][ T211] NO_NETWORK=
[ 35.696739][ T211]
[ 35.730000][ T194] sysrq: Emergency Sync
[ 35.730860][ T57] Emergency Sync complete
[ 35.731988][ T194] sysrq: Resetting
Kboot worker: lkp-worker23
Elapsed time: 60
kvm=(
qemu-system-x86_64
-enable-kvm
-cpu SandyBridge
-kernel $kernel
-initrd initrd-vm-meta-307.cgz
-m 16384
-smp 2
To reproduce:
# build kernel
cd linux
cp config-6.4.0-rc4-00504-g01671335e600 .config
make HOSTCC=gcc-12 CC=gcc-12 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-12 CC=gcc-12 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki