2009-08-26 20:00:21

by Jan Kara

[permalink] [raw]
Subject: Buffer state bits

Hello,

When working on my page_mkwrite() improvements for blocksize < pagesize,
I've put down a description of buffer state bits (because I was thinking
whether I could you some of them for my purpose). Below is what I've ended
with - suggestions for improvements or even contributions are welcome. I
plan to put this somewhere to Documentation/ once it gets reasonably
complete...
There are some questions / suggestions for cleanups in there marked with
XXX so opinions on that are also welcome...

Honza


State bits in buffer heads
==========================

BH_Req
XXX: Not really used?

BH_Dirty
- Ideally, this bit should mean "buffer has data that have to be written". But
it is not quite true. The problem happens when someone calls set_page_dirty()
on the page to which buffers are attached or similarly when buffers are
attached to a dirty page. Then all buffers attached to the page are marked
dirty - even those that are beyond end of file which obviously should not
be written.

When buffer is dirty, the page has to be dirty as well (mark buffer dirty
takes care of that). It is not necessarily the other way around and buffer
dirty bit is what ultimately decides whether the buffer goes to disk or not.

BH_Lock
- Used as bit spinlock. Buffer is locked when it is submitted for IO and unlocked
when the IO finishes. It is used by other places to protect against IO happening
on the buffer (e.g. when copying new data into the buffer etc.).

BH_Uptodate
- Buffer contains data that can be trusted. Generally, this flag means that
what is stored in memory is at least as new as what is stored on disk in the
corresponding block (if it has already been allocated). For buffers that are
covering a hole and user has not yet written to it, the flags means the buffer
is correctly filled with zeros. Buffers beyond the end of file are the only
ones where the contents actually cannot be trusted even though BH_Uptodate bit
is set. User can mmap the last page of the file and write even to buffers
beyond EOF attached to this page. So these buffers can contain anything
although one might expect them to contain zeros.

The flag is set in end_io handlers (under buffer lock) and in other places
copying data into the buffer / page (under a page lock for data buffers and
buffer lock for metadata buffers). The bit is cleared in end_io handlers when
the IO failed. The problem with this is that when the failing IO was write,
the resulting buffer state is not accurate since the buffer holds newer data
than are on disk. Long term, we want to get rid of clearing uptodate bit on
failed write so use BH_Write_EIO for write error detection in new code.

BH_Mapped
- Buffer has a physical block backing it stored in b_bdev + b_blocknr. This bit
is set by filesystem's get_block() function (or by VFS itself for block device
mappings).

XXX: Some filesystems set BH_Mapped even for buffers that do no really
have the backing block (like buffers for delayed allocation). I think
we should get rid of it...

BH_New
- Buffer is freshly allocated. This flag is usually set by filesystem's
get_block function when it freshly allocates block backing the buffer.
VFS then takes care of calling unmap_underlying_metadata on the buffer
and zeroing out the buffer. When all is done, the flag is cleared. So
this flag should not be seen set after we drop a page lock.

Note that because of unmap_underlying_metadata call, buffer has to be
mapped when BH_New is set. That is part of the reason why some filesystems
map delayed-allocated buffer to some bogus block - they want VFS to do the
zeroing but do not have a real block to map the buffer to yet.

BH_Delay
- Allocation of physical block backing the buffer is delayed. This flag is set
by filesystem's get_block function to mark that filesystem knows that this
buffer needs to get written (usually space is reserved for the buffer) but
it does not have physical block assigned yet - that usually happens when
memory management decides to write out dirty data or we have to write out
the page for other reason (like if fsync has been called).

XXX: Currently, the handling of delayed buffers in VFS is kind of convoluted
because delayed buffers are mapped. If they wouldn't be, VFS wouldn't need
to care about this bit at all.

BH_Unwritten
- Used by a filesystem to mark that although buffer is not dirty, it contains
data different from those on disk. This is usually used by a filesystem to
mark buffers whose backing blocks are not initialized to zeros and do not
want VFS to load the junk from disk

XXX: Do we need this flag at all? If filesystem's get_block function just
marked the buffer as uptodate and
a) zeroed it out in the read case
b) marked it as new in the write case (we could zero out the buffer here
as well, which would be cleaner but it would be unnecessary for buffers
to which data will be written immediately afterwards).
It would have exactly the same effect as BH_Unwritten flag has.

BH_Async_Read
- Buffer is being read from disk. This is used by async reading code. When a
page should be read from disk, all mapped buffers in it are marked with this
flag. When IO on the buffer finishes, end_io handler (end_buffer_async_read)
clears the flag and checks whether all the buffers in the page have the flag
cleared. If so, it marks the page as uptodate and unlocks it.

BH_Async_Write
- Buffer is being written to disk. This is used by async writing code. When a
page should be written to disk, all buffers to be written are marked with
this flag. When IO on the buffer finishes, end_io handler (usually
end_buffer_async_write) clears the flag anch checks whether all the buffers in
the page have the flag cleared. If so, it ends writeback on the page.

BH_Uptodate_Lock
- Used as bit spinlock by end_buffer_async_read and end_buffer_async_write to
synchronize checking of BH_Async_Read and BH_Async_Write flags.

BH_Boundary
- Set by the filesystem to indicate that the next block on the media is probably
going to contain metadata. The flag is used by code in __mpage_writepage() to
submit the next block on the media for write (if it is dirty) to optimize
writeout pattern in a common case when the layout on disk looks like:
D|D|D|M|D|D|D (where D is a data block and M a block containing metadata
needed to access further data).

BH_Write_EIO
- IO error happened when we tried to write the buffer. This flag is set when
write of the buffer fails. The flag is cleared each time we submit the buffer
for write. The flag is used mainly to pass down the information to the
filesystem. When the buffer with this flag set should be dropped from memory,
we set AS_EIO flag on the mapping this buffer belongs to or on b_assoc_map if
set.

BH_Ordered
- Buffer is an IO barrier (see Documentation/block/barrier.txt)

BH_Eopnotsupp
- Set when the IO request ended with EOPNOTSUPP. Currently this only happens
when the buffer has been submitted with BH_Ordered bit set and the underlying
device does not support IO barriers. This flags is used to pass the information
down to the filesystems so that they can somehow handle the situation.

BH_Quiet
- Do not print error message when error happened. Set when BIO_QUIET bit was set.
XXX: Never cleared?!?
--
Jan Kara <[email protected]>
SUSE Labs, CR


2009-08-26 21:27:00

by Jamie Lokier

[permalink] [raw]
Subject: Re: Buffer state bits

Jan Kara wrote:
> BH_Dirty
> - Ideally, this bit should mean "buffer has data that have to be
> written". But it is not quite true. The problem happens when
> someone calls set_page_dirty() on the page to which buffers are
> attached or similarly when buffers are attached to a dirty
> page. Then all buffers attached to the page are marked dirty -
> even those that are beyond end of file which obviously should not
> be written.
>
> When buffer is dirty, the page has to be dirty as well (mark
> buffer dirty takes care of that). It is not necessarily the other
> way around and buffer dirty bit is what ultimately decides whether
> the buffer goes to disk or not.

That last sentence implies page can be dirty while a buffer in the
page is not dirty.

In that case, do buffers beyond the end of file need to be set dirty
by set_page_dirty()? If yes, perhaps the text could explain why.

-- Jamie

2009-08-27 11:14:03

by Jan Kara

[permalink] [raw]
Subject: Re: Buffer state bits

On Wed 26-08-09 22:27:00, Jamie Lokier wrote:
> Jan Kara wrote:
> > BH_Dirty
> > - Ideally, this bit should mean "buffer has data that have to be
> > written". But it is not quite true. The problem happens when
> > someone calls set_page_dirty() on the page to which buffers are
> > attached or similarly when buffers are attached to a dirty
> > page. Then all buffers attached to the page are marked dirty -
> > even those that are beyond end of file which obviously should not
> > be written.
> >
> > When buffer is dirty, the page has to be dirty as well (mark
> > buffer dirty takes care of that). It is not necessarily the other
> > way around and buffer dirty bit is what ultimately decides whether
> > the buffer goes to disk or not.
>
> That last sentence implies page can be dirty while a buffer in the
> page is not dirty.
Yes, that happens.

> In that case, do buffers beyond the end of file need to be set dirty
> by set_page_dirty()? If yes, perhaps the text could explain why.
No, they need not. But it's racy to check i_size in set_page_dirty
because we don't hold i_mutex... I'll add some explanation to the
paragraph.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-08-27 12:33:23

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: Buffer state bits

>
> BH_Mapped
> - Buffer has a physical block backing it stored in b_bdev + b_blocknr. This bit
> is set by filesystem's get_block() function (or by VFS itself for block device
> mappings).
>
> XXX: Some filesystems set BH_Mapped even for buffers that do no really
> have the backing block (like buffers for delayed allocation). I think
> we should get rid of it...
>

Also we don't want get_block to be called multiple times for the same file offset.
__block_prepare_write does get_block looking at the BH_Mapped flag. ie one of reason
delay and unwritten buffer_heads are also marked mapped.


-aneesh

2009-08-27 14:32:15

by Jan Kara

[permalink] [raw]
Subject: Re: Buffer state bits

On Thu 27-08-09 18:03:18, Aneesh Kumar K.V wrote:
> >
> > BH_Mapped
> > - Buffer has a physical block backing it stored in b_bdev + b_blocknr. This bit
> > is set by filesystem's get_block() function (or by VFS itself for block device
> > mappings).
> >
> > XXX: Some filesystems set BH_Mapped even for buffers that do no really
> > have the backing block (like buffers for delayed allocation). I think
> > we should get rid of it...
> >
>
> Also we don't want get_block to be called multiple times for the same file offset.
> __block_prepare_write does get_block looking at the BH_Mapped flag. ie one of reason
> delay and unwritten buffer_heads are also marked mapped.
Unwritten buffers should be mapped as they really have the backing block.
So that's fine.
Delayed buffers should not be mapped. We could change the check in
__block_prepare_write to "!buffer_mapped(bh) && !buffer_delay(bh)". But I'd
rather avoid it because what I'd like to do in ext3 is to delay-allocate
mmapped blocks and allocate normally blocks written via write(2). So I want
ext3_get_block() to be called in block_prepare_write() even for delay
buffers.
If ext4 doesn't want to do anything for delay buffers in get_block() called
from block_prepare_write(), it can just return from the beginning of
ext4_get_block() when it finds the buffer is delay...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR