2023-04-14 13:50:55

by Hannes Reinecke

[permalink] [raw]
Subject: [PATCH] mm/filemap: allocate folios according to the blocksize

If the blocksize is larger than the pagesize allocate folios
with the correct order.

Signed-off-by: Hannes Reinecke <[email protected]>
---
mm/filemap.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 05fd86752489..468f25714ced 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1927,7 +1927,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
folio_wait_stable(folio);
no_page:
if (!folio && (fgp_flags & FGP_CREAT)) {
- int err;
+ int err, order = 0;
if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
gfp |= __GFP_WRITE;
if (fgp_flags & FGP_NOFS)
@@ -1937,7 +1937,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
gfp |= GFP_NOWAIT | __GFP_NOWARN;
}

- folio = filemap_alloc_folio(gfp, 0);
+ if (mapping->host->i_blkbits > PAGE_SHIFT)
+ order = mapping->host->i_blkbits - PAGE_SHIFT;
+ folio = filemap_alloc_folio(gfp, order);
if (!folio)
return NULL;

@@ -2492,9 +2494,11 @@ static int filemap_create_folio(struct file *file,
struct folio_batch *fbatch)
{
struct folio *folio;
- int error;
+ int error, order = 0;

- folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+ if (mapping->host->i_blkbits > PAGE_SHIFT)
+ order = mapping->host->i_blkbits - PAGE_SHIFT;
+ folio = filemap_alloc_folio(mapping_gfp_mask(mapping), order);
if (!folio)
return -ENOMEM;

@@ -3607,14 +3611,16 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
pgoff_t index, filler_t filler, struct file *file, gfp_t gfp)
{
struct folio *folio;
- int err;
+ int err, order = 0;

+ if (mapping->host->i_blkbits > PAGE_SHIFT)
+ order = mapping->host->i_blkbits - PAGE_SHIFT;
if (!filler)
filler = mapping->a_ops->read_folio;
repeat:
folio = filemap_get_folio(mapping, index);
if (!folio) {
- folio = filemap_alloc_folio(gfp, 0);
+ folio = filemap_alloc_folio(gfp, order);
if (!folio)
return ERR_PTR(-ENOMEM);
err = filemap_add_folio(mapping, folio, index, gfp);
--
2.35.3


2023-04-14 14:00:30

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/filemap: allocate folios according to the blocksize

On Fri, Apr 14, 2023 at 03:49:08PM +0200, Hannes Reinecke wrote:
> If the blocksize is larger than the pagesize allocate folios
> with the correct order.

I think you also need to get the one in page_cache_ra_unbounded()
and page_cache_ra_order() also needs some love?

It might help to put an assertion in __filemap_add_folio() to be sure
we're not doing something unacceptable.

2023-04-16 04:58:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/filemap: allocate folios according to the blocksize

On Fri, Apr 14, 2023 at 03:49:08PM +0200, Hannes Reinecke wrote:
> @@ -3607,14 +3611,16 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
> pgoff_t index, filler_t filler, struct file *file, gfp_t gfp)
> {
> struct folio *folio;
> - int err;
> + int err, order = 0;
>
> + if (mapping->host->i_blkbits > PAGE_SHIFT)
> + order = mapping->host->i_blkbits - PAGE_SHIFT;

This pattern comes up a few times. What I did in a patch I wrote back
in December 2020 and never submitted (!) was this:


@@ -198,9 +198,15 @@ enum mapping_flags {
AS_EXITING = 4, /* final truncate in progress */
/* writeback related tags are not used */
AS_NO_WRITEBACK_TAGS = 5,
- AS_LARGE_FOLIO_SUPPORT = 6,
+ AS_FOLIO_ORDER_MIN = 8,
+ AS_FOLIO_ORDER_MAX = 13,
+ /* 8-17 are used for FOLIO_ORDER */
};

+#define AS_FOLIO_ORDER_MIN_MASK 0x00001f00
+#define AS_FOLIO_ORDER_MAX_MASK 0x0002e000
+#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)

...

+static inline unsigned mapping_min_folio_order(struct address_space *mapping)
+{
+ return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
+}

(do we really need 5 bits for each, or can we get by with eg 3 or 4 bits?)

Anyway, the point is that we could set this quite early in the creation
of the mapping, and eliminate the conditional setting of order.

2023-04-17 02:22:29

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] mm/filemap: allocate folios according to the blocksize


Hello,

kernel test robot noticed "WARNING:at_include/linux/pagemap.h:#split_huge_page_to_list" on:

commit: 3f1c33c25c31221a7a27d302ce6aac8e9b71edbb ("[PATCH] mm/filemap: allocate folios according to the blocksize")
url: https://github.com/intel-lab-lkp/linux/commits/Hannes-Reinecke/mm-filemap-allocate-folios-according-to-the-blocksize/20230414-215648
base: v6.3-rc6
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [PATCH] mm/filemap: allocate folios according to the blocksize

in testcase: xfstests
version: xfstests-x86_64-a7df89e-1_20230410
with following parameters:

disk: 4HDD
fs: ext4
fs2: smbv3
test: generic-group-28



compiler: gcc-11
test machine: 8 threads Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (Skylake) with 16G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-lkp/[email protected]


[ 96.792002][ T3511] ------------[ cut here ]------------
[ 96.797305][ T3511] WARNING: CPU: 3 PID: 3511 at include/linux/pagemap.h:344 split_huge_page_to_list (include/linux/pagemap.h:344 mm/huge_memory.c:2767)
[ 96.807438][ T3511] Modules linked in: loop cmac nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver dm_mod btrfs blake2b_generic xor intel_rapl_msr raid6_pq intel_rapl_common zstd_compress libcrc32c x86_pkg_temp_thermal intel_powerclamp coretemp sd_mod t10_pi kvm_intel crc64_rocksoft_generic i915 kvm crc64_rocksoft crc64 irqbypass crct10dif_pclmul sg crc32_pclmul ipmi_devintf crc32c_intel ipmi_msghandler drm_buddy ghash_clmulni_intel intel_gtt sha512_ssse3 drm_display_helper drm_kms_helper mei_wdt syscopyarea ahci rapl sysfillrect mei_me intel_cstate libahci sysimgblt wmi_bmof video intel_uncore serio_raw libata mei intel_pch_thermal ttm intel_pmc_core wmi acpi_pad tpm_infineon drm fuse ip_tables
[ 96.868429][ T3511] CPU: 3 PID: 3511 Comm: xfs_io Not tainted 6.3.0-rc6-00001-g3f1c33c25c31 #1
[ 96.877009][ T3511] Hardware name: HP HP Z240 SFF Workstation/802E, BIOS N51 Ver. 01.63 10/05/2017
[ 96.885923][ T3511] RIP: 0010:split_huge_page_to_list (include/linux/pagemap.h:344 mm/huge_memory.c:2767)
[ 96.891989][ T3511] Code: 89 f2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 31 02 00 00 49 8b 84 24 98 00 00 00 a8 40 0f 85 f6 fe ff ff <0f> 0b e9 ef fe ff ff 81 e6 60 ec 0b 00 a9 00 00 04 00 44 0f 45 ee
All code
========
0: 89 f2 mov %esi,%edx
2: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
9: fc ff df
c: 48 c1 ea 03 shr $0x3,%rdx
10: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)
14: 0f 85 31 02 00 00 jne 0x24b
1a: 49 8b 84 24 98 00 00 mov 0x98(%r12),%rax
21: 00
22: a8 40 test $0x40,%al
24: 0f 85 f6 fe ff ff jne 0xffffffffffffff20
2a:* 0f 0b ud2 <-- trapping instruction
2c: e9 ef fe ff ff jmpq 0xffffffffffffff20
31: 81 e6 60 ec 0b 00 and $0xbec60,%esi
37: a9 00 00 04 00 test $0x40000,%eax
3c: 44 0f 45 ee cmovne %esi,%r13d

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: e9 ef fe ff ff jmpq 0xfffffffffffffef6
7: 81 e6 60 ec 0b 00 and $0xbec60,%esi
d: a9 00 00 04 00 test $0x40000,%eax
12: 44 0f 45 ee cmovne %esi,%r13d
[ 96.911358][ T3511] RSP: 0018:ffffc9000420f9d8 EFLAGS: 00010046
[ 96.917253][ T3511] RAX: 0000000000000000 RBX: 1ffff92000841f43 RCX: ffffffff8192f43e
[ 96.925040][ T3511] RDX: 1ffff11084ce792a RSI: 0000000000000008 RDI: ffff88842673c950
[ 96.932836][ T3511] RBP: ffffea0005796700 R08: 0000000000000000 R09: ffff88842673c957
[ 96.940625][ T3511] R10: ffffed1084ce792a R11: 0000000000000001 R12: ffff88842673c8b8
[ 96.948419][ T3511] R13: 0000000000000000 R14: ffff88842673c950 R15: ffffea0005796700
[ 96.956211][ T3511] FS: 00007f83d8bdc800(0000) GS:ffff8883cf580000(0000) knlGS:0000000000000000
[ 96.964956][ T3511] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 96.971364][ T3511] CR2: 00007f83d8bdaef8 CR3: 000000013f57e006 CR4: 00000000003706e0
[ 96.979162][ T3511] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 96.986958][ T3511] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 96.994758][ T3511] Call Trace:
[ 96.997888][ T3511] <TASK>
[ 97.000679][ T3511] ? can_split_folio (mm/huge_memory.c:2646)
[ 97.005455][ T3511] ? __filemap_get_folio (arch/x86/include/asm/bitops.h:138 arch/x86/include/asm/bitops.h:144 include/asm-generic/bitops/instrumented-lock.h:58 include/linux/pagemap.h:915 include/linux/pagemap.h:951 mm/filemap.c:1936)
[ 97.010578][ T3511] ? zero_user_segments (include/linux/highmem.h:282)
[ 97.016651][ T3511] truncate_inode_partial_folio (mm/truncate.c:243)
[ 97.022376][ T3511] truncate_inode_pages_range (mm/truncate.c:380)
[ 97.027928][ T3511] ? folio_add_lru (arch/x86/include/asm/preempt.h:85 mm/swap.c:518)
[ 97.032358][ T3511] ? truncate_inode_partial_folio (mm/truncate.c:332)
[ 97.038261][ T3511] ? policy_node (include/linux/nodemask.h:266 mm/mempolicy.c:1869)
[ 97.042604][ T3511] ? __cond_resched (kernel/sched/core.c:8489)
[ 97.047115][ T3511] ? down_read (arch/x86/include/asm/atomic64_64.h:34 include/linux/atomic/atomic-long.h:41 include/linux/atomic/atomic-instrumented.h:1280 kernel/locking/rwsem.c:176 kernel/locking/rwsem.c:181 kernel/locking/rwsem.c:249 kernel/locking/rwsem.c:1249 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522)
[ 97.051372][ T3511] ? rwsem_down_read_slowpath (kernel/locking/rwsem.c:1518)
[ 97.056924][ T3511] ? SMB2_set_eof (fs/cifs/smb2pdu.c:5197) cifs
[ 97.062103][ T3511] ? up_read (arch/x86/include/asm/atomic64_64.h:160 include/linux/atomic/atomic-long.h:71 include/linux/atomic/atomic-instrumented.h:1318 kernel/locking/rwsem.c:1347 kernel/locking/rwsem.c:1616)
[ 97.066015][ T3511] ? unmap_mapping_range (mm/memory.c:3541)
[ 97.071051][ T3511] ? __do_fault (mm/memory.c:3541)
[ 97.075387][ T3511] ? _raw_spin_lock (arch/x86/include/asm/atomic.h:202 include/linux/atomic/atomic-instrumented.h:543 include/asm-generic/qspinlock.h:111 include/linux/spinlock.h:186 include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154)
[ 97.079896][ T3511] ? __do_fault (mm/memory.c:3541)
[ 97.084234][ T3511] ? zero_user_segments+0x19e/0x240 cifs
[ 97.090958][ T3511] truncate_pagecache (mm/truncate.c:744)
[ 97.095638][ T3511] smb3_simple_falloc+0xcbf/0x1840 cifs
[ 97.101834][ T3511] ? smb3_fiemap (fs/cifs/smb2ops.c:3587) cifs
[ 97.106919][ T3511] ? __do_sys_clone (kernel/fork.c:2812)
[ 97.111433][ T3511] ? __do_sys_vfork (kernel/fork.c:2812)
[ 97.115948][ T3511] vfs_fallocate (fs/open.c:324)
[ 97.120381][ T3511] __x64_sys_fallocate (include/linux/file.h:44 fs/open.c:348 fs/open.c:355 fs/open.c:353 fs/open.c:353)
[ 97.125238][ T3511] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[ 97.129490][ T3511] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
[ 97.135206][ T3511] RIP: 0033:0x7f83d8f5d647
[ 97.139454][ T3511] Code: 89 7c 24 08 48 89 4c 24 18 e8 55 07 f9 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 8b 74 24 0c 8b 7c 24 08 b8 1d 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 89 44 24 08 e8 85 07 f9 ff 8b 44
All code
========
0: 89 7c 24 08 mov %edi,0x8(%rsp)
4: 48 89 4c 24 18 mov %rcx,0x18(%rsp)
9: e8 55 07 f9 ff callq 0xfffffffffff90763
e: 4c 8b 54 24 18 mov 0x18(%rsp),%r10
13: 48 8b 54 24 10 mov 0x10(%rsp),%rdx
18: 41 89 c0 mov %eax,%r8d
1b: 8b 74 24 0c mov 0xc(%rsp),%esi
1f: 8b 7c 24 08 mov 0x8(%rsp),%edi
23: b8 1d 01 00 00 mov $0x11d,%eax
28: 0f 05 syscall
2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction
30: 77 31 ja 0x63
32: 44 89 c7 mov %r8d,%edi
35: 89 44 24 08 mov %eax,0x8(%rsp)
39: e8 85 07 f9 ff callq 0xfffffffffff907c3
3e: 8b .byte 0x8b
3f: 44 rex.R

Code starting with the faulting instruction
===========================================
0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax
6: 77 31 ja 0x39
8: 44 89 c7 mov %r8d,%edi
b: 89 44 24 08 mov %eax,0x8(%rsp)
f: e8 85 07 f9 ff callq 0xfffffffffff90799
14: 8b .byte 0x8b
15: 44 rex.R
[ 97.158825][ T3511] RSP: 002b:00007fffd0739670 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
[ 97.167058][ T3511] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f83d8f5d647
[ 97.174858][ T3511] RDX: 00000000000003ff RSI: 0000000000000000 RDI: 0000000000000004
[ 97.182644][ T3511] RBP: 000055cbb600cf50 R08: 0000000000000000 R09: 0000000000000000
[ 97.190444][ T3511] R10: 0000000000000002 R11: 0000000000000293 R12: 000055cbb4968340
[ 97.198236][ T3511] R13: 0000000000000000 R14: 000055cbb600cf30 R15: 000055cbb600cf50
[ 97.206035][ T3511] </TASK>
[ 97.208906][ T3511] ---[ end trace 0000000000000000 ]---
[ 97.331260][ T292] generic/568 _check_dmesg: something found in dmesg (see /lkp/benchmarks/xfstests/results//generic/568.dmesg)
[ 97.331271][ T292]


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests



Attachments:
(No filename) (9.96 kB)
config-6.3.0-rc6-00001-g3f1c33c25c31 (164.08 kB)
job-script (5.93 kB)
dmesg.xz (42.31 kB)
xfstests (2.83 kB)
job.yaml (4.97 kB)
reproduce (1.54 kB)
Download all attachments

2023-04-17 04:28:27

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/filemap: allocate folios according to the blocksize

On Mon, Apr 17, 2023 at 10:18:37AM +0800, kernel test robot wrote:
> commit: 3f1c33c25c31221a7a27d302ce6aac8e9b71edbb ("[PATCH] mm/filemap: allocate folios according to the blocksize")

> [ 96.797305][ T3511] WARNING: CPU: 3 PID: 3511 at include/linux/pagemap.h:344 split_huge_page_to_list (include/linux/pagemap.h:344 mm/huge_memory.c:2767)

Oh, funny. That's:

WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0);

so it's found a large folio in a mapping which doesn't claim to support
large folios. This is a good warning; thank you, bot!

> [ 96.994758][ T3511] Call Trace:
> [ 96.997888][ T3511] <TASK>
> [ 97.016651][ T3511] truncate_inode_partial_folio (mm/truncate.c:243)
> [ 97.022376][ T3511] truncate_inode_pages_range (mm/truncate.c:380)
> [ 97.090958][ T3511] truncate_pagecache (mm/truncate.c:744)
> [ 97.095638][ T3511] smb3_simple_falloc+0xcbf/0x1840 cifs

Ah, it's coming from smb/cifs. I bet it's set i_blkbits to something
large. Looks like we need to do something like I suggested to set the
minimum folio size in the mapping, rather than basing it on
mapping->host->i_blkbits.

2023-04-17 04:41:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] mm/filemap: allocate folios according to the blocksize

On Fri, Apr 14, 2023 at 03:49:08PM +0200, Hannes Reinecke wrote:
> If the blocksize is larger than the pagesize allocate folios
> with the correct order.

And how is that supposed to actually happen?

2023-04-17 06:14:26

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH] mm/filemap: allocate folios according to the blocksize

On 4/17/23 06:36, Christoph Hellwig wrote:
> On Fri, Apr 14, 2023 at 03:49:08PM +0200, Hannes Reinecke wrote:
>> If the blocksize is larger than the pagesize allocate folios
>> with the correct order.
>
> And how is that supposed to actually happen?

By using my patcheset to brd and set the logical blocksize to eg 16k.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

2023-04-17 06:40:15

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH] mm/filemap: allocate folios according to the blocksize

On 4/17/23 08:27, Christoph Hellwig wrote:
> On Mon, Apr 17, 2023 at 08:08:04AM +0200, Hannes Reinecke wrote:
>> On 4/17/23 06:36, Christoph Hellwig wrote:
>>> On Fri, Apr 14, 2023 at 03:49:08PM +0200, Hannes Reinecke wrote:
>>>> If the blocksize is larger than the pagesize allocate folios
>>>> with the correct order.
>>>
>>> And how is that supposed to actually happen?
>>
>> By using my patcheset to brd and set the logical blocksize to eg 16k.
>
> Then add it to your patchset instead of trying to add dead code to the
> kernel while also losing context.

Yeah, sorry. Was a bit overexcited here.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman

2023-04-17 06:40:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] mm/filemap: allocate folios according to the blocksize

On Mon, Apr 17, 2023 at 08:08:04AM +0200, Hannes Reinecke wrote:
> On 4/17/23 06:36, Christoph Hellwig wrote:
> > On Fri, Apr 14, 2023 at 03:49:08PM +0200, Hannes Reinecke wrote:
> > > If the blocksize is larger than the pagesize allocate folios
> > > with the correct order.
> >
> > And how is that supposed to actually happen?
>
> By using my patcheset to brd and set the logical blocksize to eg 16k.

Then add it to your patchset instead of trying to add dead code to the
kernel while also losing context.

2023-04-20 12:10:14

by Pankaj Raghav

[permalink] [raw]
Subject: Re: [PATCH] mm/filemap: allocate folios according to the blocksize

To keep this thread alive and get some direction on the next steps, I made some changes
with which I am able to do **buffered reads** with fio on brd with logical block size > 4k.

Along with your patches (this patch and the brd patches), I added the following diff:

diff --git a/fs/mpage.c b/fs/mpage.c
index 242e213ee064..2e0c066d72d3 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -161,7 +161,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
struct folio *folio = args->folio;
struct inode *inode = folio->mapping->host;
const unsigned blkbits = inode->i_blkbits;
- const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+ const unsigned blocks_per_page = folio_size(folio) >> blkbits;
const unsigned blocksize = 1 << blkbits;
struct buffer_head *map_bh = &args->map_bh;
sector_t block_in_file;
diff --git a/mm/readahead.c b/mm/readahead.c
index 47afbca1d122..2e42b5127f4c 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -210,7 +210,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
unsigned long index = readahead_index(ractl);
gfp_t gfp_mask = readahead_gfp_mask(mapping);
unsigned long i;
-
+ int order = 0;
/*
* Partway through the readahead operation, we will have added
* locked pages to the page cache, but will not yet have submitted
@@ -223,6 +223,9 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
*/
unsigned int nofs = memalloc_nofs_save();

+ if (mapping->host->i_blkbits > PAGE_SHIFT)
+ order = mapping->host->i_blkbits - PAGE_SHIFT;
+
filemap_invalidate_lock_shared(mapping);
/*
* Preallocate as many pages as we will need.
@@ -245,7 +248,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
continue;
}

- folio = filemap_alloc_folio(gfp_mask, 0);
+ folio = filemap_alloc_folio(gfp_mask, order);
if (!folio)
break;
if (filemap_add_folio(mapping, folio, index + i,
@@ -259,7 +262,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
if (i == nr_to_read - lookahead_size)
folio_set_readahead(folio);
ractl->_workingset |= folio_test_workingset(folio);
- ractl->_nr_pages++;
+ ractl->_nr_pages += folio_nr_pages(folio);
}


And with that (drum roll):

root@debian:~# cat /sys/block/ram0/queue/logical_block_size
8192
root@debian:~# fio -bs=8k -iodepth=8 -rw=read -ioengine=io_uring -size=200M -name=io_uring_1
-filename=/dev/ram0
io_uring_1: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring,
iodepth=8
fio-3.33
Starting 1 process

io_uring_1: (groupid=0, jobs=1): err= 0: pid=450: Thu Apr 20 11:34:10 2023
read: IOPS=94.8k, BW=741MiB/s (777MB/s)(40.0MiB/54msec)

<snip>

Run status group 0 (all jobs):
READ: bw=741MiB/s (777MB/s), 741MiB/s-741MiB/s (777MB/s-777MB/s), io=40.0MiB (41.9MB), run=54-54msec

Disk stats (read/write):
ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


**Questions on the future work**:

As willy pointed out, we have to do this `order = mapping->host->i_blkbits - PAGE_SHIFT` in
many places. Should we pursue something that willy suggested: encapsulating order in the
mapping->flags as a next step?[1]


[1] https://lore.kernel.org/lkml/[email protected]/

2023-04-20 12:31:32

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH] mm/filemap: allocate folios according to the blocksize

On 4/20/23 14:05, Pankaj Raghav wrote:
> To keep this thread alive and get some direction on the next steps, I made some changes
> with which I am able to do **buffered reads** with fio on brd with logical block size > 4k.
>
> Along with your patches (this patch and the brd patches), I added the following diff:
>
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 242e213ee064..2e0c066d72d3 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -161,7 +161,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
> struct folio *folio = args->folio;
> struct inode *inode = folio->mapping->host;
> const unsigned blkbits = inode->i_blkbits;
> - const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
> + const unsigned blocks_per_page = folio_size(folio) >> blkbits;
> const unsigned blocksize = 1 << blkbits;
> struct buffer_head *map_bh = &args->map_bh;
> sector_t block_in_file;
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 47afbca1d122..2e42b5127f4c 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -210,7 +210,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> unsigned long index = readahead_index(ractl);
> gfp_t gfp_mask = readahead_gfp_mask(mapping);
> unsigned long i;
> -
> + int order = 0;
> /*
> * Partway through the readahead operation, we will have added
> * locked pages to the page cache, but will not yet have submitted
> @@ -223,6 +223,9 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> */
> unsigned int nofs = memalloc_nofs_save();
>
> + if (mapping->host->i_blkbits > PAGE_SHIFT)
> + order = mapping->host->i_blkbits - PAGE_SHIFT;
> +
> filemap_invalidate_lock_shared(mapping);
> /*
> * Preallocate as many pages as we will need.
> @@ -245,7 +248,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> continue;
> }
>
> - folio = filemap_alloc_folio(gfp_mask, 0);
> + folio = filemap_alloc_folio(gfp_mask, order);
> if (!folio)
> break;
> if (filemap_add_folio(mapping, folio, index + i,
> @@ -259,7 +262,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> if (i == nr_to_read - lookahead_size)
> folio_set_readahead(folio);
> ractl->_workingset |= folio_test_workingset(folio);
> - ractl->_nr_pages++;
> + ractl->_nr_pages += folio_nr_pages(folio);
> }
>
>
> And with that (drum roll):
>
> root@debian:~# cat /sys/block/ram0/queue/logical_block_size
> 8192
> root@debian:~# fio -bs=8k -iodepth=8 -rw=read -ioengine=io_uring -size=200M -name=io_uring_1
> -filename=/dev/ram0
> io_uring_1: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring,
> iodepth=8
> fio-3.33
> Starting 1 process
>
> io_uring_1: (groupid=0, jobs=1): err= 0: pid=450: Thu Apr 20 11:34:10 2023
> read: IOPS=94.8k, BW=741MiB/s (777MB/s)(40.0MiB/54msec)
>
> <snip>
>
> Run status group 0 (all jobs):
> READ: bw=741MiB/s (777MB/s), 741MiB/s-741MiB/s (777MB/s-777MB/s), io=40.0MiB (41.9MB), run=54-54msec
>
> Disk stats (read/write):
> ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>
>
> **Questions on the future work**:
>
> As willy pointed out, we have to do this `order = mapping->host->i_blkbits - PAGE_SHIFT` in
> many places. Should we pursue something that willy suggested: encapsulating order in the
> mapping->flags as a next step?[1]
>
>
> [1] https://lore.kernel.org/lkml/[email protected]/

Well ... really, not sure.
Yes, continue updating buffer_heads would be a logical thing as it could
be done incrementally.

But really, the end-goal should be to move away from buffer_heads for fs
and mm usage. So I wonder if we shouldn't rather look in that direction..

Cheers,

Hannes

2023-04-20 12:33:16

by Pankaj Raghav

[permalink] [raw]
Subject: Re: [PATCH] mm/filemap: allocate folios according to the blocksize

On 2023-04-20 14:19, Hannes Reinecke wrote:
>>
>> **Questions on the future work**:
>>
>> As willy pointed out, we have to do this `order = mapping->host->i_blkbits - PAGE_SHIFT` in
>> many places. Should we pursue something that willy suggested: encapsulating order in the
>> mapping->flags as a next step?[1]
>>
>>
>> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Well ... really, not sure.
> Yes, continue updating buffer_heads would be a logical thing as it could be done incrementally.
>
> But really, the end-goal should be to move away from buffer_heads for fs and mm usage. So I wonder
> if we shouldn't rather look in that direction..
>
Yeah, I understand that part. Hopefully, this will be discussed as a part of LSFMM.

But the changes that are done in filemap and readahead needs to be done anyway irrespective of the
underlying aops right? Or Am I missing something.

> Cheers,
>
> Hannes
>

2023-04-20 15:22:15

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH] mm/filemap: allocate folios according to the blocksize

On Thu, Apr 20, 2023 at 02:28:36PM +0200, Pankaj Raghav wrote:
> On 2023-04-20 14:19, Hannes Reinecke wrote:
> >>
> >> **Questions on the future work**:
> >>
> >> As willy pointed out, we have to do this `order = mapping->host->i_blkbits - PAGE_SHIFT` in
> >> many places. Should we pursue something that willy suggested: encapsulating order in the
> >> mapping->flags as a next step?[1]
> >>
> >>
> >> [1] https://lore.kernel.org/lkml/[email protected]/

I wouldn't mind XFS gaining a means to control folio sizes, personally.
At least that would make it easier to explore things like copy on write
with large weird file allocation unit sizes (2MB, 28k, etc).

> >
> > Well ... really, not sure.
> > Yes, continue updating buffer_heads would be a logical thing as it could be done incrementally.
> >
> > But really, the end-goal should be to move away from buffer_heads for fs and mm usage. So I wonder
> > if we shouldn't rather look in that direction..
> >
> Yeah, I understand that part. Hopefully, this will be discussed as a part of LSFMM.

Agree.

--D

>
> But the changes that are done in filemap and readahead needs to be done anyway irrespective of the
> underlying aops right? Or Am I missing something.
>
> > Cheers,
> >
> > Hannes
> >

2023-04-22 00:26:20

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] mm/filemap: allocate folios according to the blocksize

On Thu, Apr 20, 2023 at 08:03:55AM -0700, Darrick J. Wong wrote:
> On Thu, Apr 20, 2023 at 02:28:36PM +0200, Pankaj Raghav wrote:
> > On 2023-04-20 14:19, Hannes Reinecke wrote:
> > >>
> > >> **Questions on the future work**:
> > >>
> > >> As willy pointed out, we have to do this `order = mapping->host->i_blkbits - PAGE_SHIFT` in
> > >> many places. Should we pursue something that willy suggested: encapsulating order in the
> > >> mapping->flags as a next step?[1]
> > >>
> > >>
> > >> [1] https://lore.kernel.org/lkml/[email protected]/
>
> I wouldn't mind XFS gaining a means to control folio sizes, personally.
> At least that would make it easier to explore things like copy on write
> with large weird file allocation unit sizes (2MB, 28k, etc).

[random historical musings follow]

Ah, how things go full circle. This is how XFS originally worked on
Irix - it had read and write IO sizes that it used to control the
size of buffers allocated ifor caching data in the per-inode chunk
cache. That stuff was in the original port to Linux, we even had a
mount option to allow users to determine preferred IO sizes
(biosize) but that was removed in 2019 ago because it hadn't
done anything for more than a decade.

That was done around the same time the mp->m_readio_blocks and
mp->m_writeio_blocks variables that it originally influenced were
removed. Extent allocation sizes are still influenced this way by
the allocsize mount option and m_writeio_blocks was renamed
m_allocsize_blocks to keep this functionality for extent
allocation...

Cheers,

Dave.
--
Dave Chinner
[email protected]