One of the first kernel panic we hit when we try to increase the
block size > 4k is inside create_page_buffers()[1]. Even though buffer.c
function do not support large folios (folios > PAGE_SIZE) at the moment,
these changes are required when we want to remove that constraint.
Willy had already mentioned that he wanted to convert create_page_buffers to
create_folio_buffers but didn't get to it yet, so I decided to take a
shot.
No functional changes introduced.
OI:
- I don't like the fact that I had to introduce
folio_create_empty_buffers() as create_empty_buffers() is used in
many parts of the kernel. Should I do a big bang change as a part of
this series where we convert create_empty_buffers to take a folio and
change the callers to pass a folio instead of a page?
- I split the series into 4 commits for clarity. I could squash them
into one patch if needed.
[1] https://lore.kernel.org/all/[email protected]/
Pankaj Raghav (4):
fs/buffer: add set_bh_folio helper
buffer: add alloc_folio_buffers() helper
fs/buffer: add folio_create_empty_buffers helper
fs/buffer: convert create_page_buffers to create_folio_buffers
fs/buffer.c | 131 +++++++++++++++++++++++++++++++++---
include/linux/buffer_head.h | 4 ++
2 files changed, 125 insertions(+), 10 deletions(-)
--
2.34.1
The folio version of set_bh_page(). This is required to convert
create_page_buffers() to create_folio_buffers() later in the series.
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/buffer.c | 15 +++++++++++++++
include/linux/buffer_head.h | 2 ++
2 files changed, 17 insertions(+)
diff --git a/fs/buffer.c b/fs/buffer.c
index b3eb905f87d6..44380ff3a31f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1484,6 +1484,21 @@ void set_bh_page(struct buffer_head *bh,
}
EXPORT_SYMBOL(set_bh_page);
+void set_bh_folio(struct buffer_head *bh, struct folio *folio,
+ unsigned long offset)
+{
+ bh->b_folio = folio;
+ BUG_ON(offset >= folio_size(folio));
+ if (folio_test_highmem(folio))
+ /*
+ * This catches illegal uses and preserves the offset:
+ */
+ bh->b_data = (char *)(0 + offset);
+ else
+ bh->b_data = folio_address(folio) + offset;
+}
+EXPORT_SYMBOL(set_bh_folio);
+
/*
* Called when truncating a buffer on a page completely.
*/
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 8f14dca5fed7..d5a2ef9b4cdf 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -196,6 +196,8 @@ void mark_buffer_write_io_error(struct buffer_head *bh);
void touch_buffer(struct buffer_head *bh);
void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset);
+void set_bh_folio(struct buffer_head *bh, struct folio *folio,
+ unsigned long offset);
bool try_to_free_buffers(struct folio *);
struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
bool retry);
--
2.34.1
Folio version of alloc_page_buffers() helper. This is required to convert
create_page_buffers() to create_folio_buffers() later in the series.
It removes one call to compound_head() compared to alloc_page_buffers().
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/buffer.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
diff --git a/fs/buffer.c b/fs/buffer.c
index 44380ff3a31f..0f9c2127543d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -900,6 +900,65 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
}
EXPORT_SYMBOL_GPL(alloc_page_buffers);
+/*
+ * Create the appropriate buffers when given a folio for data area and
+ * the size of each buffer.. Use the bh->b_this_page linked list to
+ * follow the buffers created. Return NULL if unable to create more
+ * buffers.
+ *
+ * The retry flag is used to differentiate async IO (paging, swapping)
+ * which may not fail from ordinary buffer allocations.
+ */
+struct buffer_head *alloc_folio_buffers(struct folio *folio, unsigned long size,
+ bool retry)
+{
+ struct buffer_head *bh, *head;
+ gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
+ long offset;
+ struct mem_cgroup *memcg, *old_memcg;
+
+ if (retry)
+ gfp |= __GFP_NOFAIL;
+
+ /* The folio lock pins the memcg */
+ memcg = folio_memcg(folio);
+ old_memcg = set_active_memcg(memcg);
+
+ head = NULL;
+ offset = folio_size(folio);
+ while ((offset -= size) >= 0) {
+ bh = alloc_buffer_head(gfp);
+ if (!bh)
+ goto no_grow;
+
+ bh->b_this_page = head;
+ bh->b_blocknr = -1;
+ head = bh;
+
+ bh->b_size = size;
+
+ /* Link the buffer to its folio */
+ set_bh_folio(bh, folio, offset);
+ }
+out:
+ set_active_memcg(old_memcg);
+ return head;
+/*
+ * In case anything failed, we just free everything we got.
+ */
+no_grow:
+ if (head) {
+ do {
+ bh = head;
+ head = head->b_this_page;
+ free_buffer_head(bh);
+ } while (head);
+ }
+
+ goto out;
+}
+EXPORT_SYMBOL_GPL(alloc_folio_buffers);
+
static inline void
link_dev_buffers(struct page *page, struct buffer_head *head)
{
--
2.34.1
Folio version of create_empty_buffers(). This is required to convert
create_page_buffers() to create_folio_buffers() later in the series.
It removes several calls to compound_head() as it works directly on folio
compared to create_empty_buffers().
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/buffer.c | 34 ++++++++++++++++++++++++++++++++++
include/linux/buffer_head.h | 2 ++
2 files changed, 36 insertions(+)
diff --git a/fs/buffer.c b/fs/buffer.c
index 0f9c2127543d..9e6a1a738fb5 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1645,6 +1645,40 @@ void block_invalidate_folio(struct folio *folio, size_t offset, size_t length)
}
EXPORT_SYMBOL(block_invalidate_folio);
+/*
+ * We attach and possibly dirty the buffers atomically wrt
+ * block_dirty_folio() via private_lock. try_to_free_buffers
+ * is already excluded via the folio lock.
+ */
+void folio_create_empty_buffers(struct folio *folio, unsigned long blocksize,
+ unsigned long b_state)
+{
+ struct buffer_head *bh, *head, *tail;
+
+ head = alloc_folio_buffers(folio, blocksize, true);
+ bh = head;
+ do {
+ bh->b_state |= b_state;
+ tail = bh;
+ bh = bh->b_this_page;
+ } while (bh);
+ tail->b_this_page = head;
+
+ spin_lock(&folio->mapping->private_lock);
+ if (folio_test_uptodate(folio) || folio_test_dirty(folio)) {
+ bh = head;
+ do {
+ if (folio_test_dirty(folio))
+ set_buffer_dirty(bh);
+ if (folio_test_uptodate(folio))
+ set_buffer_uptodate(bh);
+ bh = bh->b_this_page;
+ } while (bh != head);
+ }
+ folio_attach_private(folio, head);
+ spin_unlock(&folio->mapping->private_lock);
+}
+EXPORT_SYMBOL(folio_create_empty_buffers);
/*
* We attach and possibly dirty the buffers atomically wrt
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index d5a2ef9b4cdf..8afa91cbb8e2 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -203,6 +203,8 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
bool retry);
void create_empty_buffers(struct page *, unsigned long,
unsigned long b_state);
+void folio_create_empty_buffers(struct folio *, unsigned long,
+ unsigned long b_state);
void end_buffer_read_sync(struct buffer_head *bh, int uptodate);
void end_buffer_write_sync(struct buffer_head *bh, int uptodate);
void end_buffer_async_write(struct buffer_head *bh, int uptodate);
--
2.34.1
fs/buffer do not support large folios as there are many assumptions on
the folio size to be the host page size. This conversion is one step
towards removing that assumption. Also this conversion will reduce calls
to compound_head() if create_folio_buffers() calls
folio_create_empty_buffers().
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/buffer.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 9e6a1a738fb5..a83d9bf78ca5 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1802,14 +1802,17 @@ static inline int block_size_bits(unsigned int blocksize)
return ilog2(blocksize);
}
-static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state)
+static struct buffer_head *create_folio_buffers(struct folio *folio,
+ struct inode *inode,
+ unsigned int b_state)
{
- BUG_ON(!PageLocked(page));
+ BUG_ON(!folio_test_locked(folio));
- if (!page_has_buffers(page))
- create_empty_buffers(page, 1 << READ_ONCE(inode->i_blkbits),
- b_state);
- return page_buffers(page);
+ if (!folio_buffers(folio))
+ folio_create_empty_buffers(folio,
+ 1 << READ_ONCE(inode->i_blkbits),
+ b_state);
+ return folio_buffers(folio);
}
/*
@@ -1853,8 +1856,8 @@ int __block_write_full_page(struct inode *inode, struct page *page,
int nr_underway = 0;
blk_opf_t write_flags = wbc_to_write_flags(wbc);
- head = create_page_buffers(page, inode,
- (1 << BH_Dirty)|(1 << BH_Uptodate));
+ head = create_folio_buffers(page_folio(page), inode,
+ (1 << BH_Dirty) | (1 << BH_Uptodate));
/*
* Be very careful. We have no exclusion from block_dirty_folio
@@ -2117,7 +2120,7 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
BUG_ON(to > PAGE_SIZE);
BUG_ON(from > to);
- head = create_page_buffers(&folio->page, inode, 0);
+ head = create_folio_buffers(folio, inode, 0);
blocksize = head->b_size;
bbits = block_size_bits(blocksize);
@@ -2403,7 +2406,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
- head = create_page_buffers(&folio->page, inode, 0);
+ head = create_folio_buffers(folio, inode, 0);
blocksize = head->b_size;
bbits = block_size_bits(blocksize);
--
2.34.1
On Fri, Apr 14, 2023 at 01:08:19PM +0200, Pankaj Raghav wrote:
> Folio version of alloc_page_buffers() helper. This is required to convert
> create_page_buffers() to create_folio_buffers() later in the series.
>
> It removes one call to compound_head() compared to alloc_page_buffers().
I would convert alloc_page_buffers() to folio_alloc_buffers() and
add
static struct buffer_head *alloc_page_buffers(struct page *page,
unsigned long size, bool retry)
{
return folio_alloc_buffers(page_folio(page), size, retry);
}
in buffer_head.h
(there are only five callers, so this feels like a better tradeoff
than creating a new function)
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> fs/buffer.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 59 insertions(+)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 44380ff3a31f..0f9c2127543d 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -900,6 +900,65 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
> }
> EXPORT_SYMBOL_GPL(alloc_page_buffers);
>
> +/*
> + * Create the appropriate buffers when given a folio for data area and
> + * the size of each buffer.. Use the bh->b_this_page linked list to
> + * follow the buffers created. Return NULL if unable to create more
> + * buffers.
> + *
> + * The retry flag is used to differentiate async IO (paging, swapping)
> + * which may not fail from ordinary buffer allocations.
> + */
> +struct buffer_head *alloc_folio_buffers(struct folio *folio, unsigned long size,
> + bool retry)
> +{
> + struct buffer_head *bh, *head;
> + gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
> + long offset;
> + struct mem_cgroup *memcg, *old_memcg;
> +
> + if (retry)
> + gfp |= __GFP_NOFAIL;
> +
> + /* The folio lock pins the memcg */
> + memcg = folio_memcg(folio);
> + old_memcg = set_active_memcg(memcg);
> +
> + head = NULL;
> + offset = folio_size(folio);
> + while ((offset -= size) >= 0) {
> + bh = alloc_buffer_head(gfp);
> + if (!bh)
> + goto no_grow;
> +
> + bh->b_this_page = head;
> + bh->b_blocknr = -1;
> + head = bh;
> +
> + bh->b_size = size;
> +
> + /* Link the buffer to its folio */
> + set_bh_folio(bh, folio, offset);
> + }
> +out:
> + set_active_memcg(old_memcg);
> + return head;
> +/*
> + * In case anything failed, we just free everything we got.
> + */
> +no_grow:
> + if (head) {
> + do {
> + bh = head;
> + head = head->b_this_page;
> + free_buffer_head(bh);
> + } while (head);
> + }
> +
> + goto out;
> +}
> +EXPORT_SYMBOL_GPL(alloc_folio_buffers);
> +
> static inline void
> link_dev_buffers(struct page *page, struct buffer_head *head)
> {
> --
> 2.34.1
>
On Fri, Apr 14, 2023 at 01:08:20PM +0200, Pankaj Raghav wrote:
> Folio version of create_empty_buffers(). This is required to convert
> create_page_buffers() to create_folio_buffers() later in the series.
>
> It removes several calls to compound_head() as it works directly on folio
> compared to create_empty_buffers().
Again, I would create a create_empty_buffers() wrapper, but this time I
would put it in buffer.c.
On Fri, Apr 14, 2023 at 01:08:21PM +0200, Pankaj Raghav wrote:
> fs/buffer do not support large folios as there are many assumptions on
> the folio size to be the host page size. This conversion is one step
> towards removing that assumption. Also this conversion will reduce calls
> to compound_head() if create_folio_buffers() calls
> folio_create_empty_buffers().
I'd call this folio_create_buffers(), but other than that, looks good.
On 4/14/23 13:08, Pankaj Raghav wrote:
> One of the first kernel panic we hit when we try to increase the
> block size > 4k is inside create_page_buffers()[1]. Even though buffer.c
> function do not support large folios (folios > PAGE_SIZE) at the moment,
> these changes are required when we want to remove that constraint.
>
> Willy had already mentioned that he wanted to convert create_page_buffers to
> create_folio_buffers but didn't get to it yet, so I decided to take a
> shot.
>
> No functional changes introduced.
>
> OI:
> - I don't like the fact that I had to introduce
> folio_create_empty_buffers() as create_empty_buffers() is used in
> many parts of the kernel. Should I do a big bang change as a part of
> this series where we convert create_empty_buffers to take a folio and
> change the callers to pass a folio instead of a page?
>
> - I split the series into 4 commits for clarity. I could squash them
> into one patch if needed.
>
> [1] https://lore.kernel.org/all/[email protected]/
>
> Pankaj Raghav (4):
> fs/buffer: add set_bh_folio helper
> buffer: add alloc_folio_buffers() helper
> fs/buffer: add folio_create_empty_buffers helper
> fs/buffer: convert create_page_buffers to create_folio_buffers
>
> fs/buffer.c | 131 +++++++++++++++++++++++++++++++++---
> include/linux/buffer_head.h | 4 ++
> 2 files changed, 125 insertions(+), 10 deletions(-)
>
Funnily enough, I've been tinkering along the same lines, and ended up
with pretty similar patches.
I've had to use two additional patches to get my modified 'brd' driver
off the ground with logical blocksize of 16k:
- mm/filemap: allocate folios according to the blocksize
(will be sending the patch separately)
- Modify read_folio() to use the correct order:
@@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio,
get_block_t *get_block)
if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
limit = inode->i_sb->s_maxbytes;
- VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
head = create_folio_buffers(folio, inode, 0);
blocksize = head->b_size;
bbits = block_size_bits(blocksize);
- iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
+ if (WARN_ON(PAGE_SHIFT < bbits)) {
+ iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT);
+ } else {
+ iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
+ }
lblock = (limit+blocksize-1) >> bbits;
bh = head;
nr = 0;
With that (and my modified brd driver) I've been able to set the logical
blocksize to 16k for brd and have it happily loaded.
Haven't tested the write path yet, though; there's surely quite some
work to be done.
BTW; I've got another patch replacing 'writepage' with 'write_folio'
(and the corresponding argument update). Is that a direction you want to go?
Cheers,
Hannes
On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote:
> BTW; I've got another patch replacing 'writepage' with 'write_folio'
> (and the corresponding argument update). Is that a direction you want to go?
No; ->writepage is being deleted. It's already gone from ext4 and xfs.
On 4/14/23 15:51, Matthew Wilcox wrote:
> On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote:
>> BTW; I've got another patch replacing 'writepage' with 'write_folio'
>> (and the corresponding argument update). Is that a direction you want to go?
>
> No; ->writepage is being deleted. It's already gone from ext4 and xfs.
Aw.
And here's me having converted block/fops over to using iomap w/
iomap_writepage(). Tough.
Oh well.
Wasn't a great fit anyway as for a sb_bread() replacement we would need
a sub-page access for iomap.
Question is whether we really need that or shouldn't read PAGE_SIZE
sectors always. Surely would make life easier. And would save us all the
logic with bh_lru etc as we can rely on the page cache.
But probably an item for the iomap discussion at LSF.
Unless you got plans already ...
Cheers,
Hannes
>> Pankaj Raghav (4):
>> fs/buffer: add set_bh_folio helper
>> buffer: add alloc_folio_buffers() helper
>> fs/buffer: add folio_create_empty_buffers helper
>> fs/buffer: convert create_page_buffers to create_folio_buffers
>>
>> fs/buffer.c | 131 +++++++++++++++++++++++++++++++++---
>> include/linux/buffer_head.h | 4 ++
>> 2 files changed, 125 insertions(+), 10 deletions(-)
>>
> Funnily enough, I've been tinkering along the same lines, and ended up with pretty similar patches.
> I've had to use two additional patches to get my modified 'brd' driver off the ground with logical
> blocksize of 16k:
Good to know that we are working on a similar direction here.
> - mm/filemap: allocate folios according to the blocksize
> (will be sending the patch separately)
> - Modify read_folio() to use the correct order:
>
> @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
> limit = inode->i_sb->s_maxbytes;
>
> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -
> head = create_folio_buffers(folio, inode, 0);
> blocksize = head->b_size;
> bbits = block_size_bits(blocksize);
>
> - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> + if (WARN_ON(PAGE_SHIFT < bbits)) {
> + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT);
> + } else {
> + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> + }
> lblock = (limit+blocksize-1) >> bbits;
> bh = head;
> nr = 0;
>
>
> With that (and my modified brd driver) I've been able to set the logical blocksize to 16k for brd
> and have it happily loaded.
I will give your patches a try as well.
On 2023-04-14 15:06, Matthew Wilcox wrote:
> On Fri, Apr 14, 2023 at 01:08:19PM +0200, Pankaj Raghav wrote:
>> Folio version of alloc_page_buffers() helper. This is required to convert
>> create_page_buffers() to create_folio_buffers() later in the series.
>>
>> It removes one call to compound_head() compared to alloc_page_buffers().
>
> I would convert alloc_page_buffers() to folio_alloc_buffers() and
> add
>
> static struct buffer_head *alloc_page_buffers(struct page *page,
> unsigned long size, bool retry)
> {
> return folio_alloc_buffers(page_folio(page), size, retry);
> }
>
> in buffer_head.h
>
> (there are only five callers, so this feels like a better tradeoff
> than creating a new function)
>
That is a good idea and follows the usual pattern for folio conversion. I will
send a new version soon with your other comments as well.
--
Pankaj
On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote:
> On 4/14/23 13:08, Pankaj Raghav wrote:
> > One of the first kernel panic we hit when we try to increase the
> > block size > 4k is inside create_page_buffers()[1]. Even though buffer.c
> > function do not support large folios (folios > PAGE_SIZE) at the moment,
> > these changes are required when we want to remove that constraint.
> Funnily enough, I've been tinkering along the same lines, and ended up with
> pretty similar patches.
> I've had to use two additional patches to get my modified 'brd' driver off
> the ground with logical blocksize of 16k:
> - mm/filemap: allocate folios according to the blocksize
> (will be sending the patch separately)
> - Modify read_folio() to use the correct order:
>
> @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio,
> get_block_t *get_block)
> if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
> limit = inode->i_sb->s_maxbytes;
>
> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -
> head = create_folio_buffers(folio, inode, 0);
> blocksize = head->b_size;
> bbits = block_size_bits(blocksize);
>
> - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> + if (WARN_ON(PAGE_SHIFT < bbits)) {
> + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT);
> + } else {
> + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> + }
> lblock = (limit+blocksize-1) >> bbits;
> bh = head;
> nr = 0;
At a quick glance I think both approaches (unless Hannes already did it)
seem to just miss that pesky static *arr[MAX_BUF_PER_PAGE], and so I
think we need to:
a) dynamically allocate those now
b) do a cursory review of the users of that and prepare them
to grok buffer heads which are blocksize based rather than
PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE.
Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for
bs > PAGE_SIZE right now.
Luis
On Fri, Apr 14, 2023 at 06:01:16PM -0700, Luis Chamberlain wrote:
> a) dynamically allocate those now
> b) do a cursory review of the users of that and prepare them
> to grok buffer heads which are blocksize based rather than
> PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE.
>
> Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for
> bs > PAGE_SIZE right now.
Worse, we'll overflow the array and corrupt the stack.
This one is a simple fix ...
+++ b/fs/buffer.c
@@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
{
struct inode *inode = folio->mapping->host;
sector_t iblock, lblock;
- struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ struct buffer_head *bh, *head;
unsigned int blocksize, bbits;
int nr, i;
int fully_mapped = 1;
@@ -2335,7 +2335,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
if (buffer_uptodate(bh))
continue;
}
- arr[nr++] = bh;
} while (i++, iblock++, (bh = bh->b_this_page) != head);
if (fully_mapped)
@@ -2353,24 +2352,27 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
}
/* Stage two: lock the buffers */
- for (i = 0; i < nr; i++) {
- bh = arr[i];
+ bh = head;
+ do {
lock_buffer(bh);
mark_buffer_async_read(bh);
- }
+ bh = bh->b_this_page;
+ } while (bh != head);
/*
* Stage 3: start the IO. Check for uptodateness
* inside the buffer lock in case another process reading
* the underlying blockdev brought it uptodate (the sct fix).
*/
- for (i = 0; i < nr; i++) {
- bh = arr[i];
+ bh = head;
+ do {
if (buffer_uptodate(bh))
end_buffer_async_read(bh, 1);
else
submit_bh(REQ_OP_READ, bh);
- }
+ bh = bh->b_this_page;
+ } while (bh != head);
+
return 0;
}
EXPORT_SYMBOL(block_read_full_folio);
On Sat, Apr 15, 2023 at 03:31:54AM +0100, Matthew Wilcox wrote:
> On Fri, Apr 14, 2023 at 06:01:16PM -0700, Luis Chamberlain wrote:
> > a) dynamically allocate those now
> > b) do a cursory review of the users of that and prepare them
> > to grok buffer heads which are blocksize based rather than
> > PAGE_SIZE based. So we just try to kill MAX_BUF_PER_PAGE.
> >
> > Without a) I think buffers after PAGE_SIZE won't get submit_bh() or lock for
> > bs > PAGE_SIZE right now.
>
> Worse, we'll overflow the array and corrupt the stack.
>
> This one is a simple fix ...
>
> +++ b/fs/buffer.c
> @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> {
> struct inode *inode = folio->mapping->host;
> sector_t iblock, lblock;
> - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> + struct buffer_head *bh, *head;
> unsigned int blocksize, bbits;
> int nr, i;
> int fully_mapped = 1;
> @@ -2335,7 +2335,6 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> if (buffer_uptodate(bh))
> continue;
> }
> - arr[nr++] = bh;
> } while (i++, iblock++, (bh = bh->b_this_page) != head);
>
> if (fully_mapped)
> @@ -2353,24 +2352,27 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> }
>
> /* Stage two: lock the buffers */
> - for (i = 0; i < nr; i++) {
> - bh = arr[i];
> + bh = head;
> + do {
> lock_buffer(bh);
> mark_buffer_async_read(bh);
> - }
> + bh = bh->b_this_page;
> + } while (bh != head);
>
> /*
> * Stage 3: start the IO. Check for uptodateness
> * inside the buffer lock in case another process reading
> * the underlying blockdev brought it uptodate (the sct fix).
> */
> - for (i = 0; i < nr; i++) {
> - bh = arr[i];
> + bh = head;
> + do {
> if (buffer_uptodate(bh))
> end_buffer_async_read(bh, 1);
> else
> submit_bh(REQ_OP_READ, bh);
> - }
> + bh = bh->b_this_page;
> + } while (bh != head);
> +
> return 0;
I thought of that but I saw that the loop that assigns the arr only
pegs a bh if we don't "continue" for certain conditions, which made me
believe that we only wanted to keep on the array as non-null items which
meet the initial loop's criteria. If that is not accurate then yes,
the simplication is nice!
Luis
On Fri, Apr 14, 2023 at 08:24:56PM -0700, Luis Chamberlain wrote:
> I thought of that but I saw that the loop that assigns the arr only
> pegs a bh if we don't "continue" for certain conditions, which made me
> believe that we only wanted to keep on the array as non-null items which
> meet the initial loop's criteria. If that is not accurate then yes,
> the simplication is nice!
Uh, right. A little bit more carefully this time ... how does this
look?
diff --git a/fs/buffer.c b/fs/buffer.c
index 5e67e21b350a..dff671079b02 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
{
struct inode *inode = folio->mapping->host;
sector_t iblock, lblock;
- struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ struct buffer_head *bh, *head;
unsigned int blocksize, bbits;
int nr, i;
int fully_mapped = 1;
@@ -2335,7 +2335,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
if (buffer_uptodate(bh))
continue;
}
- arr[nr++] = bh;
+ nr++;
} while (i++, iblock++, (bh = bh->b_this_page) != head);
if (fully_mapped)
@@ -2352,25 +2352,29 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
return 0;
}
- /* Stage two: lock the buffers */
- for (i = 0; i < nr; i++) {
- bh = arr[i];
+ /*
+ * Stage two: lock the buffers. Recheck the uptodate flag under
+ * the lock in case somebody else brought it uptodate first.
+ */
+ bh = head;
+ do {
+ if (buffer_uptodate(bh))
+ continue;
lock_buffer(bh);
+ if (buffer_uptodate(bh)) {
+ unlock_buffer(bh);
+ continue;
+ }
mark_buffer_async_read(bh);
- }
+ } while ((bh = bh->b_this_page) != head);
- /*
- * Stage 3: start the IO. Check for uptodateness
- * inside the buffer lock in case another process reading
- * the underlying blockdev brought it uptodate (the sct fix).
- */
- for (i = 0; i < nr; i++) {
- bh = arr[i];
- if (buffer_uptodate(bh))
- end_buffer_async_read(bh, 1);
- else
+ /* Stage 3: start the IO */
+ bh = head;
+ do {
+ if (buffer_async_read(bh))
submit_bh(REQ_OP_READ, bh);
- }
+ } while ((bh = bh->b_this_page) != head);
+
return 0;
}
EXPORT_SYMBOL(block_read_full_folio);
I do wonder how much it's worth doing this vs switching to non-BH methods.
I appreciate that's a lot of work still.
On 4/15/23 05:44, Matthew Wilcox wrote:
> On Fri, Apr 14, 2023 at 08:24:56PM -0700, Luis Chamberlain wrote:
>> I thought of that but I saw that the loop that assigns the arr only
>> pegs a bh if we don't "continue" for certain conditions, which made me
>> believe that we only wanted to keep on the array as non-null items which
>> meet the initial loop's criteria. If that is not accurate then yes,
>> the simplication is nice!
>
> Uh, right. A little bit more carefully this time ... how does this
> look?
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 5e67e21b350a..dff671079b02 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2282,7 +2282,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> {
> struct inode *inode = folio->mapping->host;
> sector_t iblock, lblock;
> - struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
> + struct buffer_head *bh, *head;
> unsigned int blocksize, bbits;
> int nr, i;
> int fully_mapped = 1;
> @@ -2335,7 +2335,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> if (buffer_uptodate(bh))
> continue;
> }
> - arr[nr++] = bh;
> + nr++;
> } while (i++, iblock++, (bh = bh->b_this_page) != head);
>
> if (fully_mapped)
> @@ -2352,25 +2352,29 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
> return 0;
> }
>
> - /* Stage two: lock the buffers */
> - for (i = 0; i < nr; i++) {
> - bh = arr[i];
> + /*
> + * Stage two: lock the buffers. Recheck the uptodate flag under
> + * the lock in case somebody else brought it uptodate first.
> + */
> + bh = head;
> + do {
> + if (buffer_uptodate(bh))
> + continue;
> lock_buffer(bh);
> + if (buffer_uptodate(bh)) {
> + unlock_buffer(bh);
> + continue;
> + }
> mark_buffer_async_read(bh);
> - }
> + } while ((bh = bh->b_this_page) != head);
>
> - /*
> - * Stage 3: start the IO. Check for uptodateness
> - * inside the buffer lock in case another process reading
> - * the underlying blockdev brought it uptodate (the sct fix).
> - */
> - for (i = 0; i < nr; i++) {
> - bh = arr[i];
> - if (buffer_uptodate(bh))
> - end_buffer_async_read(bh, 1);
> - else
> + /* Stage 3: start the IO */
> + bh = head;
> + do {
> + if (buffer_async_read(bh))
> submit_bh(REQ_OP_READ, bh);
> - }
> + } while ((bh = bh->b_this_page) != head);
> +
> return 0;
> }
> EXPORT_SYMBOL(block_read_full_folio);
>
>
> I do wonder how much it's worth doing this vs switching to non-BH methods.
> I appreciate that's a lot of work still.
That's what I've been wondering, too.
I would _vastly_ prefer to switch over to iomap; however, the blasted
sb_bread() is getting in the way. Currently iomap only runs on entire
pages / folios, but a lot of (older) filesystems insist on doing 512
byte I/O. While this seem logical (seeing that 512 bytes is the
default, and, in most cases, the only supported sector size) question
is whether _we_ from the linux side need to do that.
We _could_ upgrade to always do full page I/O; there's a good
chance we'll be using the entire page anyway eventually.
And with storage bandwidth getting larger and larger we might even
get a performance boost there.
And it would save us having to implement sub-page I/O for iomap.
Hmm?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman
On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote:
> On 4/15/23 05:44, Matthew Wilcox wrote:
> > I do wonder how much it's worth doing this vs switching to non-BH methods.
> > I appreciate that's a lot of work still.
>
> That's what I've been wondering, too.
>
> I would _vastly_ prefer to switch over to iomap; however, the blasted
> sb_bread() is getting in the way. Currently iomap only runs on entire
> pages / folios, but a lot of (older) filesystems insist on doing 512
Hang on, no, iomap can issue sub-page reads. eg iomap_read_folio_sync()
will read the parts of the folio which have not yet been read when
called from __iomap_write_begin().
> byte I/O. While this seem logical (seeing that 512 bytes is the
> default, and, in most cases, the only supported sector size) question
> is whether _we_ from the linux side need to do that.
> We _could_ upgrade to always do full page I/O; there's a good
> chance we'll be using the entire page anyway eventually.
> And with storage bandwidth getting larger and larger we might even
> get a performance boost there.
I think we need to look at this from the filesystem side. What do
filesystems actually want to do? The first thing is they want to read
the superblock. That's either going to be immediately freed ("Oh,
this isn't a JFS filesystem after all") or it's going to hang around
indefinitely. There's no particular need to keep it in any kind of
cache (buffer or page). Except ... we want to probe a dozen different
filesystems, and half of them keep their superblock at the same offset
from the start of the block device. So we do want to keep it cached.
That's arguing for using the page cache, at least to read it.
Now, do we want userspace to be able to dd a new superblock into place
and have the mounted filesystem see it? I suspect that confuses just
about every filesystem out there. So I think the right answer is to read
the page into the bdev's page cache and then copy it into a kmalloc'ed
buffer which the filesystem is then responsible for freeing. It's also
responsible for writing it back (so that's another API we need), and for
a journalled filesystem, it needs to fit into the journalling scheme.
Also, we may need to write back multiple copies of the superblock,
possibly with slight modifications.
There are a lot of considerations here, and I don't feel like I have
enough of an appreciation of filesystem needs to come up with a decent
API. I'd hope we can get a good discussion going at LSFMM.
On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote:
> On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote:
> > On 4/15/23 05:44, Matthew Wilcox wrote:
> > We _could_ upgrade to always do full page I/O; there's a good
> > chance we'll be using the entire page anyway eventually.
*Iff* doing away with buffer head 512 granularity could help block sizes
greater than page size where physical and logical block size > PAGE_SIZE
we whould also be able to see it on 4kn drives (logical and physical
block size == 4k). A projection could be made after.
In so far as experimenting with this, if you already have some
effort on IOMAP for bdev aops one possibility for pure experimentation
for now would be to peg a new set of aops could be set in the path
of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early
for us to know if the device is has (lbs = pbs) > 512. For NVMe for
instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We
put together and set the logical and phyiscal block size on NVMe on
nvme_update_ns_info() --> nvme_update_disk_info(), right before we
call device_add_disk(). The only way to override the aops then would
be right before device_add_disk(), or as part of a new device_add_disk_aops()
or whatever.
> > And with storage bandwidth getting larger and larger we might even
> > get a performance boost there.
>
> I think we need to look at this from the filesystem side.
Before that let's recap the bdev cache current issues.
Today by just adding the disk we move on to partition scanning
immediately unless your block driver has a flag that says otherwise. The
current crash we're evaluating with brd and that we also hit with NVMe
is due to this part.
device_add_disk() -->
disk_scan_partitions() -->
blkdev_get_whole() -->
bdev_disk_changed() -->
filemap_read_folio() --> filler()
The filler is from aops.
We don't even have a filesystem yet on these devices at this point. The entire
partition core does this partition scanning. Refer to:
disk_scan_partitions() --> block/partitions/core.c : bdev_disk_changed()
And all of that stuff is also under a 512-byte atomic operation assumption,
we could do better if wanted to.
> What do filesystems actually want to do?
So you are suggesting that the early reads of the block device by the
block cache and its use of the page cache cache should be aligned /
perhaps redesigned to assist more clearly with what modern filesystems
might actually would want today?
> The first thing is they want to read
> the superblock. That's either going to be immediately freed ("Oh,
> this isn't a JFS filesystem after all") or it's going to hang around
> indefinitely. There's no particular need to keep it in any kind of
> cache (buffer or page).
And the bdev cache would not be able to know before hand that's the case.
> Except ... we want to probe a dozen different
> filesystems, and half of them keep their superblock at the same offset
> from the start of the block device. So we do want to keep it cached.
> That's arguing for using the page cache, at least to read it.
Do we currently share anything from the bdev cache with the fs for this?
Let's say that first block device blocksize in memory.
> Now, do we want userspace to be able to dd a new superblock into place
> and have the mounted filesystem see it?
Not sure I follow this. dd a new super block?
> I suspect that confuses just
> about every filesystem out there. So I think the right answer is to read
> the page into the bdev's page cache and then copy it into a kmalloc'ed
> buffer which the filesystem is then responsible for freeing. It's also
> responsible for writing it back (so that's another API we need), and for
> a journalled filesystem, it needs to fit into the journalling scheme.
> Also, we may need to write back multiple copies of the superblock,
> possibly with slight modifications.
Are you considering these as extentions to the bdev cache?
Luis
On Sat, Apr 15, 2023 at 06:28:11PM -0700, Luis Chamberlain wrote:
> On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote:
> > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote:
> > > On 4/15/23 05:44, Matthew Wilcox wrote:
> > > We _could_ upgrade to always do full page I/O; there's a good
> > > chance we'll be using the entire page anyway eventually.
>
> *Iff* doing away with buffer head 512 granularity could help block sizes
> greater than page size where physical and logical block size > PAGE_SIZE
> we whould also be able to see it on 4kn drives (logical and physical
> block size == 4k). A projection could be made after.
>
> In so far as experimenting with this, if you already have some
> effort on IOMAP for bdev aops one possibility for pure experimentation
> for now would be to peg a new set of aops could be set in the path
> of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early
> for us to know if the device is has (lbs = pbs) > 512. For NVMe for
> instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We
> put together and set the logical and phyiscal block size on NVMe on
> nvme_update_ns_info() --> nvme_update_disk_info(), right before we
> call device_add_disk(). The only way to override the aops then would
> be right before device_add_disk(), or as part of a new device_add_disk_aops()
> or whatever.
I think you're making this harder than you need to.
For LBA size > PAGE_SIZE, there is always only going to be
one BH per folio-that-is-LBA-size, so all the problems with
more-than-8-BHs-per-4k-page don't actually exist. I don't think we
should be overriding the aops, and if we narrow the scope of large folio
support in blockdev t only supporting folio_size == LBA size, it becomes
much more feasible.
> > > And with storage bandwidth getting larger and larger we might even
> > > get a performance boost there.
> >
> > I think we need to look at this from the filesystem side.
>
> Before that let's recap the bdev cache current issues.
Ooh, yes, this is good! I was totally neglecting the partition
scanning code.
> Today by just adding the disk we move on to partition scanning
> immediately unless your block driver has a flag that says otherwise. The
> current crash we're evaluating with brd and that we also hit with NVMe
> is due to this part.
>
> device_add_disk() -->
> disk_scan_partitions() -->
> blkdev_get_whole() -->
> bdev_disk_changed() -->
> filemap_read_folio() --> filler()
>
> The filler is from aops.
Right, so you missed a step in that callchain, which is
read_mapping_folio(). That ends up in do_read_cache_folio(), which
contains the deadly:
folio = filemap_alloc_folio(gfp, 0);
so that needs to be changed to get the minimum folio order from the
mapping, and then it should work.
> > What do filesystems actually want to do?
>
> So you are suggesting that the early reads of the block device by the
> block cache and its use of the page cache cache should be aligned /
> perhaps redesigned to assist more clearly with what modern filesystems
> might actually would want today?
Perhaps? I'm just saying the needs of the block device are not the
only consideration here. I'd like an API that makes sense for the fs
author.
> > The first thing is they want to read
> > the superblock. That's either going to be immediately freed ("Oh,
> > this isn't a JFS filesystem after all") or it's going to hang around
> > indefinitely. There's no particular need to keep it in any kind of
> > cache (buffer or page).
>
> And the bdev cache would not be able to know before hand that's the case.
Right, nobody knows until it's been read and examined.
> > Except ... we want to probe a dozen different
> > filesystems, and half of them keep their superblock at the same offset
> > from the start of the block device. So we do want to keep it cached.
> > That's arguing for using the page cache, at least to read it.
>
> Do we currently share anything from the bdev cache with the fs for this?
> Let's say that first block device blocksize in memory.
sb_bread() is used by most filesystems, and the buffer cache aliases
into the page cache.
> > Now, do we want userspace to be able to dd a new superblock into place
> > and have the mounted filesystem see it?
>
> Not sure I follow this. dd a new super block?
In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N',
I can overwrite the superblock. Do we want filesystems to see that
kind of vandalism, or do we want the mounted filesystem to have its
own copy of the data and overwrite what userspace wrote the next time it
updates the superblock?
(the trick is that this may not be vandalism, it might be the sysadmin
updating the uuid or running some fsck-ish program or trying to update
the superblock to support fabulous-new-feature on next mount. does this
change the answer?)
> > I suspect that confuses just
> > about every filesystem out there. So I think the right answer is to read
> > the page into the bdev's page cache and then copy it into a kmalloc'ed
> > buffer which the filesystem is then responsible for freeing. It's also
> > responsible for writing it back (so that's another API we need), and for
> > a journalled filesystem, it needs to fit into the journalling scheme.
> > Also, we may need to write back multiple copies of the superblock,
> > possibly with slight modifications.
>
> Are you considering these as extentions to the bdev cache?
I'm trying to suggest some of the considerations that need to go into
a replacement for sb_bread().
On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote:
> On Sat, Apr 15, 2023 at 06:28:11PM -0700, Luis Chamberlain wrote:
> > On Sat, Apr 15, 2023 at 06:09:12PM +0100, Matthew Wilcox wrote:
> > > On Sat, Apr 15, 2023 at 03:14:33PM +0200, Hannes Reinecke wrote:
> > > > On 4/15/23 05:44, Matthew Wilcox wrote:
> > > > We _could_ upgrade to always do full page I/O; there's a good
> > > > chance we'll be using the entire page anyway eventually.
> >
> > *Iff* doing away with buffer head 512 granularity could help block sizes
> > greater than page size where physical and logical block size > PAGE_SIZE
> > we whould also be able to see it on 4kn drives (logical and physical
> > block size == 4k). A projection could be made after.
> >
> > In so far as experimenting with this, if you already have some
> > effort on IOMAP for bdev aops one possibility for pure experimentation
> > for now would be to peg a new set of aops could be set in the path
> > of __alloc_disk_node() --> bdev_alloc() but that's a wee-bit too early
> > for us to know if the device is has (lbs = pbs) > 512. For NVMe for
> > instance this would be nvme_alloc_ns() --> blk_mq_alloc_disk(). We
> > put together and set the logical and phyiscal block size on NVMe on
> > nvme_update_ns_info() --> nvme_update_disk_info(), right before we
> > call device_add_disk(). The only way to override the aops then would
> > be right before device_add_disk(), or as part of a new device_add_disk_aops()
> > or whatever.
>
> I think you're making this harder than you need to.
> For LBA size > PAGE_SIZE, there is always only going to be
> one BH per folio-that-is-LBA-size, so all the problems with
> more-than-8-BHs-per-4k-page don't actually exist.
Oh! Then yes, sorry!
> I don't think we
> should be overriding the aops, and if we narrow the scope of large folio
> support in blockdev t only supporting folio_size == LBA size, it becomes
> much more feasible.
I'm trying to think of the possible use cases where folio_size != LBA size
and I cannot immediately think of some. Yes there are cases where a
filesystem may use a different block for say meta data than data, but that
I believe is side issue, ie, read/writes for small metadata would have
to be accepted. At least for NVMe we have metadata size as part of the
LBA format, but from what I understand no Linux filesystem yet uses that.
> > > > And with storage bandwidth getting larger and larger we might even
> > > > get a performance boost there.
> > >
> > > I think we need to look at this from the filesystem side.
> >
> > Before that let's recap the bdev cache current issues.
>
> Ooh, yes, this is good! I was totally neglecting the partition
> scanning code.
>
> > Today by just adding the disk we move on to partition scanning
> > immediately unless your block driver has a flag that says otherwise. The
> > current crash we're evaluating with brd and that we also hit with NVMe
> > is due to this part.
> >
> > device_add_disk() -->
> > disk_scan_partitions() -->
> > blkdev_get_whole() -->
> > bdev_disk_changed() -->
> > filemap_read_folio() --> filler()
> >
> > The filler is from aops.
>
> Right, so you missed a step in that callchain, which is
> read_mapping_folio(). That ends up in do_read_cache_folio(), which
> contains the deadly:
>
> folio = filemap_alloc_folio(gfp, 0);
Right and before this we have:
if (!filler)
filler = mapping->a_ops->read_folio;
The folio is order 0 and after filemap_alloc_folio() its added to the
page cache, then filemap_read_folio(file, filler, folio)
uses the filler blkdev_read_folio() --> fs/buffer.c block_read_full_folio()
Just for posterity:
fs/buffer.c
int block_read_full_folio(struct folio *folio, get_block_t *get_block)
{
...
struct buffer_head *bh, *head,...;
...
head = create_page_buffers(&folio->page, inode, 0);
...
}
static struct buffer_head *create_page_buffers(struct page *page, struct inode *inode, unsigned int b_state)
{
BUG_ON(!PageLocked(page));
if (!page_has_buffers(page))
create_empty_buffers(page, 1 << READ_ONCE(inode->i_blkbits),
b_state);
return page_buffers(page);
}
void create_empty_buffers(struct page *page,
unsigned long blocksize, unsigned long b_state)
{
struct buffer_head *bh, *head, *tail;
head = alloc_page_buffers(page, blocksize, true);
bh = head;
do {
bh->b_state |= b_state; -----> CRASH HERE head is NULL
tail = bh;
bh = bh->b_this_page;
} while (bh);
tail->b_this_page = head;
}
Why is it NULL? The blocksize passed to alloc_page_buffers() is larger
than PAGE_SIZE and below offset is PAGE_SIZE. PAGE_SIZE - blocksize
gives a negative number and so the loop is not traversed.
struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
bool retry)
{
struct buffer_head *bh, *head;
gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
long offset;
struct mem_cgroup *memcg, *old_memcg;
if (retry)
gfp |= __GFP_NOFAIL;
/* The page lock pins the memcg */
memcg = page_memcg(page);
old_memcg = set_active_memcg(memcg);
head = NULL;
offset = PAGE_SIZE;
while ((offset -= size) >= 0) {
bh = alloc_buffer_head(gfp);
if (!bh)
goto no_grow;
bh->b_this_page = head;
bh->b_blocknr = -1;
head = bh;
bh->b_size = size;
/* Link the buffer to its page */
set_bh_page(bh, page, offset);
}
out:
set_active_memcg(old_memcg);
return head;
...
}
I see now what you say about the buffer head being of the block size
bh->b_size = size above.
> so that needs to be changed to get the minimum folio order from the
> mapping, and then it should work.
>
> > > What do filesystems actually want to do?
> >
> > So you are suggesting that the early reads of the block device by the
> > block cache and its use of the page cache cache should be aligned /
> > perhaps redesigned to assist more clearly with what modern filesystems
> > might actually would want today?
>
> Perhaps? I'm just saying the needs of the block device are not the
> only consideration here. I'd like an API that makes sense for the fs
> author.
Makes sense!
> > > The first thing is they want to read
> > > the superblock. That's either going to be immediately freed ("Oh,
> > > this isn't a JFS filesystem after all") or it's going to hang around
> > > indefinitely. There's no particular need to keep it in any kind of
> > > cache (buffer or page).
> >
> > And the bdev cache would not be able to know before hand that's the case.
>
> Right, nobody knows until it's been read and examined.
>
> > > Except ... we want to probe a dozen different
> > > filesystems, and half of them keep their superblock at the same offset
> > > from the start of the block device. So we do want to keep it cached.
> > > That's arguing for using the page cache, at least to read it.
> >
> > Do we currently share anything from the bdev cache with the fs for this?
> > Let's say that first block device blocksize in memory.
>
> sb_bread() is used by most filesystems, and the buffer cache aliases
> into the page cache.
I see thanks. I checked what xfs does and its xfs_readsb() uses its own
xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and
xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why
they did that.
> > > Now, do we want userspace to be able to dd a new superblock into place
> > > and have the mounted filesystem see it?
> >
> > Not sure I follow this. dd a new super block?
>
> In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N',
> I can overwrite the superblock. Do we want filesystems to see that
> kind of vandalism, or do we want the mounted filesystem to have its
> own copy of the data and overwrite what userspace wrote the next time it
> updates the superblock?
Oh, what happens today?
> (the trick is that this may not be vandalism, it might be the sysadmin
> updating the uuid or running some fsck-ish program or trying to update
> the superblock to support fabulous-new-feature on next mount. does this
> change the answer?)
>
> > > I suspect that confuses just
> > > about every filesystem out there. So I think the right answer is to read
> > > the page into the bdev's page cache and then copy it into a kmalloc'ed
> > > buffer which the filesystem is then responsible for freeing. It's also
> > > responsible for writing it back (so that's another API we need), and for
> > > a journalled filesystem, it needs to fit into the journalling scheme.
> > > Also, we may need to write back multiple copies of the superblock,
> > > possibly with slight modifications.
> >
> > Are you considering these as extentions to the bdev cache?
>
> I'm trying to suggest some of the considerations that need to go into
> a replacement for sb_bread().
I see! That would also help EOL buffer heads for that purpose.
Luis
On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote:
> On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote:
> > I don't think we
> > should be overriding the aops, and if we narrow the scope of large folio
> > support in blockdev t only supporting folio_size == LBA size, it becomes
> > much more feasible.
>
> I'm trying to think of the possible use cases where folio_size != LBA size
> and I cannot immediately think of some. Yes there are cases where a
> filesystem may use a different block for say meta data than data, but that
> I believe is side issue, ie, read/writes for small metadata would have
> to be accepted. At least for NVMe we have metadata size as part of the
> LBA format, but from what I understand no Linux filesystem yet uses that.
NVMe metadata is per-block metadata -- a CRC or similar. Filesystem
metadata is things like directories, inode tables, free space bitmaps,
etc.
> struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
> bool retry)
> {
[...]
> head = NULL;
> offset = PAGE_SIZE;
> while ((offset -= size) >= 0) {
>
> I see now what you say about the buffer head being of the block size
> bh->b_size = size above.
Yes, just changing that to 'offset = page_size(page);' will do the trick.
> > sb_bread() is used by most filesystems, and the buffer cache aliases
> > into the page cache.
>
> I see thanks. I checked what xfs does and its xfs_readsb() uses its own
> xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and
> xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why
> they did that.
IRIX didn't have an sb_bread() ;-)
> > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N',
> > I can overwrite the superblock. Do we want filesystems to see that
> > kind of vandalism, or do we want the mounted filesystem to have its
> > own copy of the data and overwrite what userspace wrote the next time it
> > updates the superblock?
>
> Oh, what happens today?
Depends on the filesystem, I think? Not really sure, to be honest.
On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote:
> > > > Except ... we want to probe a dozen different
> > > > filesystems, and half of them keep their superblock at the same offset
> > > > from the start of the block device. So we do want to keep it cached.
> > > > That's arguing for using the page cache, at least to read it.
> > >
> > > Do we currently share anything from the bdev cache with the fs for this?
> > > Let's say that first block device blocksize in memory.
> >
> > sb_bread() is used by most filesystems, and the buffer cache aliases
> > into the page cache.
>
> I see thanks. I checked what xfs does and its xfs_readsb() uses its own
> xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and
> xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why
> they did that.
XFS has it's own metadata address space for caching - it does not
use the block device page cache at all. This is not new, it never
has.
The xfs_buf buffer cache does not use the page cache, either. It
does it's own thing, has it's own indexing, locking, shrinkers, etc.
IOWs, it does not use the iomap infrastructure at all - iomap is
used by XFS exclusively for data IO.
As for why we use an uncached buffer for the superblock? That's
largely historic because prior to 2007 every modification that did
allocation/free needed to lock and modify the superblock at
transaction commit. Hence it's always needed in memory but a
critical fast path, so it is always directly available without
needing to do a cache lookup to callers that need it.
In 2007, lazy superblock counters got rid of the requirement to lock
the superblock buffer in every transaction commit, so the uncached
buffer optimisation hasn't really been needed for the past decade.
But if it ain't broke, don't try to fix it....
> > > > Now, do we want userspace to be able to dd a new superblock into place
> > > > and have the mounted filesystem see it?
> > >
> > > Not sure I follow this. dd a new super block?
> >
> > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N',
> > I can overwrite the superblock. Do we want filesystems to see that
> > kind of vandalism, or do we want the mounted filesystem to have its
> > own copy of the data and overwrite what userspace wrote the next time it
> > updates the superblock?
>
> Oh, what happens today?
In XFS, it will completely ignore the fact the the superblock got
trashed like this. When the fs goes idle, or the sb modified for
some other reason, it will relog the in-memory superblock and write
it back to disk, thereby fixing the corruption. i.e. while the
filesystem is mounted, the superblock is _write-only_...
> > (the trick is that this may not be vandalism, it might be the sysadmin
> > updating the uuid or running some fsck-ish program or trying to update
> > the superblock to support fabulous-new-feature on next mount. does this
> > change the answer?)
If you need to change anything in the superblock while the XFS fs is
mounted, then you have to use ioctls to modify the superblock
contents through the running transaction subsystem. Editting the
block device directly breaks the security model of filesystems that
assume they have exclusive access to the block device whilst the
filesystem is mounted....
-Dave.
--
Dave Chinner
[email protected]
On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote:
> @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio,
> get_block_t *get_block)
> if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
> limit = inode->i_sb->s_maxbytes;
>
> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -
> head = create_folio_buffers(folio, inode, 0);
> blocksize = head->b_size;
> bbits = block_size_bits(blocksize);
>
> - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> + if (WARN_ON(PAGE_SHIFT < bbits)) {
> + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT);
> + } else {
> + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> + }
> lblock = (limit+blocksize-1) >> bbits;
> bh = head;
> nr = 0;
BTW I See this pattern in:
fs/mpage.c: do_mpage_readpage()
fs/mpage.c: __mpage_writepage()
A helper might be in order.
Luis
On 4/17/23 04:27, Luis Chamberlain wrote:
> On Fri, Apr 14, 2023 at 03:47:13PM +0200, Hannes Reinecke wrote:
>> @@ -2333,13 +2395,15 @@ int block_read_full_folio(struct folio *folio,
>> get_block_t *get_block)
>> if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode))
>> limit = inode->i_sb->s_maxbytes;
>>
>> - VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -
>> head = create_folio_buffers(folio, inode, 0);
>> blocksize = head->b_size;
>> bbits = block_size_bits(blocksize);
>>
>> - iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
>> + if (WARN_ON(PAGE_SHIFT < bbits)) {
>> + iblock = (sector_t)folio->index >> (bbits - PAGE_SHIFT);
>> + } else {
>> + iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
>> + }
>> lblock = (limit+blocksize-1) >> bbits;
>> bh = head;
>> nr = 0;
>
> BTW I See this pattern in:
>
> fs/mpage.c: do_mpage_readpage()
> fs/mpage.c: __mpage_writepage()
>
> A helper might be in order.
>
But only _iff_ we decide to stick with buffer_heads and upgrade the
buffer_head code to handle multi-page block sizes.
The idea was to move over to iomap, hence I'm not sure into which
lengths we should go keeping buffer_heads alive ...
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman
On Sun, Apr 16, 2023 at 03:07:33PM +0100, Matthew Wilcox wrote:
> On Sat, Apr 15, 2023 at 10:26:42PM -0700, Luis Chamberlain wrote:
> > On Sun, Apr 16, 2023 at 04:40:06AM +0100, Matthew Wilcox wrote:
> > > I don't think we
> > > should be overriding the aops, and if we narrow the scope of large folio
> > > support in blockdev t only supporting folio_size == LBA size, it becomes
> > > much more feasible.
> >
> > I'm trying to think of the possible use cases where folio_size != LBA size
> > and I cannot immediately think of some. Yes there are cases where a
> > filesystem may use a different block for say meta data than data, but that
> > I believe is side issue, ie, read/writes for small metadata would have
> > to be accepted. At least for NVMe we have metadata size as part of the
> > LBA format, but from what I understand no Linux filesystem yet uses that.
>
> NVMe metadata is per-block metadata -- a CRC or similar. Filesystem
> metadata is things like directories, inode tables, free space bitmaps,
> etc.
>
> > struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
> > bool retry)
> > {
> [...]
> > head = NULL;
> > offset = PAGE_SIZE;
> > while ((offset -= size) >= 0) {
> >
> > I see now what you say about the buffer head being of the block size
> > bh->b_size = size above.
>
> Yes, just changing that to 'offset = page_size(page);' will do the trick.
>
> > > sb_bread() is used by most filesystems, and the buffer cache aliases
> > > into the page cache.
> >
> > I see thanks. I checked what xfs does and its xfs_readsb() uses its own
> > xfs_buf_read_uncached(). It ends up calling xfs_buf_submit() and
> > xfs_buf_ioapply_map() does it's own submit_bio(). So I'm curious why
> > they did that.
>
> IRIX didn't have an sb_bread() ;-)
>
> > > In userspace, if I run 'dd if=blah of=/dev/sda1 bs=512 count=1 seek=N',
> > > I can overwrite the superblock. Do we want filesystems to see that
> > > kind of vandalism, or do we want the mounted filesystem to have its
> > > own copy of the data and overwrite what userspace wrote the next time it
> > > updates the superblock?
> >
> > Oh, what happens today?
>
> Depends on the filesystem, I think? Not really sure, to be honest.
The filesystem driver sees the vandalism, and can very well crash as a
result[1]. In that case it was corrupted journal contents being
replayed, but the same thing would happen if you wrote a malicious
userspace program to set the metadata_csum feature flag in the ondisk
superblock after mounting the fs.
https://bugzilla.kernel.org/show_bug.cgi?id=82201#c4
I've tried to prevent people from writing to mounted block devices in
the past, but did not succeed. If you try to prevent programs from
opening such devices with O_RDWR/O_WRONLY you then break lvm tools which
require that ability even though they don't actually write anything to
the block device. If you make the block device write_iter function
fail, then old e2fsprogs breaks and you get shouted at for breaking
userspace.
Hence I decided to let security researchers find these bugs and control
the design discussion via CVE. That's not correct and it's not smart,
but it preserves some of my sanity.
--D