The block layer will check and suppress flush bio if the device write
cache is not enabled, so the journal barrier will not go into effect
even if uer specify 'barrier=1' mount option. It's dangerous if the
write cache state is false negative, and we cannot distinguish such
case easily. So just give an info and an inquire interface to let
sysadmin know the barrier is suppressed for the case of write cache is
not enabled.
Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/super.c | 3 +++
fs/ext4/sysfs.c | 19 +++++++++++++++++++
2 files changed, 22 insertions(+)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7cdd2138c897..916f756ebbca 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5920,6 +5920,9 @@ static int ext4_load_journal(struct super_block *sb,
if (!(journal->j_flags & JBD2_BARRIER))
ext4_msg(sb, KERN_INFO, "barriers disabled");
+ else if (!bdev_write_cache(journal->j_dev))
+ ext4_msg(sb, KERN_INFO, "journal device write cache disabled, "
+ "barriers suppressed");
if (!ext4_has_feature_journal_needs_recovery(sb))
err = jbd2_journal_wipe(journal, !really_read_only);
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index d233c24ea342..67f619c1202e 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -37,6 +37,7 @@ typedef enum {
attr_pointer_string,
attr_pointer_atomic,
attr_journal_task,
+ attr_journal_barrier,
} attr_id_t;
typedef enum {
@@ -135,6 +136,20 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
task_pid_vnr(sbi->s_journal->j_task));
}
+static ssize_t journal_barrier_show(struct ext4_sb_info *sbi, char *buf)
+{
+ journal_t *journal = sbi->s_journal;
+
+ if (!journal)
+ return sysfs_emit(buf, "none\n");
+
+ if (!(journal->j_flags & JBD2_BARRIER))
+ return sysfs_emit(buf, "disabled\n");
+ if (!bdev_write_cache(sbi->s_journal->j_dev))
+ return sysfs_emit(buf, "suppressed\n");
+ return sysfs_emit(buf, "enabled\n");
+}
+
#define EXT4_ATTR(_name,_mode,_id) \
static struct ext4_attr ext4_attr_##_name = { \
.attr = {.name = __stringify(_name), .mode = _mode }, \
@@ -243,6 +258,7 @@ EXT4_RO_ATTR_ES_STRING(last_error_func, s_last_error_func, 32);
EXT4_ATTR(first_error_time, 0444, first_error_time);
EXT4_ATTR(last_error_time, 0444, last_error_time);
EXT4_ATTR(journal_task, 0444, journal_task);
+EXT4_ATTR(journal_barrier, 0444, journal_barrier);
EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
EXT4_RW_ATTR_SBI_UL(last_trim_minblks, s_last_trim_minblks);
@@ -291,6 +307,7 @@ static struct attribute *ext4_attrs[] = {
ATTR_LIST(first_error_time),
ATTR_LIST(last_error_time),
ATTR_LIST(journal_task),
+ ATTR_LIST(journal_barrier),
#ifdef CONFIG_EXT4_DEBUG
ATTR_LIST(simulate_fail),
#endif
@@ -438,6 +455,8 @@ static ssize_t ext4_attr_show(struct kobject *kobj,
return print_tstamp(buf, sbi->s_es, s_last_error_time);
case attr_journal_task:
return journal_task_show(sbi, buf);
+ case attr_journal_barrier:
+ return journal_barrier_show(sbi, buf);
}
return 0;
--
2.31.1
On Thu 24-11-22 21:57:44, Zhang Yi wrote:
> The block layer will check and suppress flush bio if the device write
> cache is not enabled, so the journal barrier will not go into effect
> even if uer specify 'barrier=1' mount option. It's dangerous if the
> write cache state is false negative, and we cannot distinguish such
> case easily. So just give an info and an inquire interface to let
> sysadmin know the barrier is suppressed for the case of write cache is
> not enabled.
>
> Signed-off-by: Zhang Yi <[email protected]>
Hum, so have you seen a situation when write cache information is incorrect
in the block layer? Does it happen often enough that it warrants extra
sysfs file?
After all you should be able to query what the block layer thinks about the
write cache - you definitely can for SCSI devices, I'm not sure about
others. So you can have a look there. Providing this info in the filesystem
seems like doing it in the wrong layer - I don't see anything jbd2/ext4
specific here...
Honza
> ---
> fs/ext4/super.c | 3 +++
> fs/ext4/sysfs.c | 19 +++++++++++++++++++
> 2 files changed, 22 insertions(+)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 7cdd2138c897..916f756ebbca 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5920,6 +5920,9 @@ static int ext4_load_journal(struct super_block *sb,
>
> if (!(journal->j_flags & JBD2_BARRIER))
> ext4_msg(sb, KERN_INFO, "barriers disabled");
> + else if (!bdev_write_cache(journal->j_dev))
> + ext4_msg(sb, KERN_INFO, "journal device write cache disabled, "
> + "barriers suppressed");
>
> if (!ext4_has_feature_journal_needs_recovery(sb))
> err = jbd2_journal_wipe(journal, !really_read_only);
> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> index d233c24ea342..67f619c1202e 100644
> --- a/fs/ext4/sysfs.c
> +++ b/fs/ext4/sysfs.c
> @@ -37,6 +37,7 @@ typedef enum {
> attr_pointer_string,
> attr_pointer_atomic,
> attr_journal_task,
> + attr_journal_barrier,
> } attr_id_t;
>
> typedef enum {
> @@ -135,6 +136,20 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
> task_pid_vnr(sbi->s_journal->j_task));
> }
>
> +static ssize_t journal_barrier_show(struct ext4_sb_info *sbi, char *buf)
> +{
> + journal_t *journal = sbi->s_journal;
> +
> + if (!journal)
> + return sysfs_emit(buf, "none\n");
> +
> + if (!(journal->j_flags & JBD2_BARRIER))
> + return sysfs_emit(buf, "disabled\n");
> + if (!bdev_write_cache(sbi->s_journal->j_dev))
> + return sysfs_emit(buf, "suppressed\n");
> + return sysfs_emit(buf, "enabled\n");
> +}
> +
> #define EXT4_ATTR(_name,_mode,_id) \
> static struct ext4_attr ext4_attr_##_name = { \
> .attr = {.name = __stringify(_name), .mode = _mode }, \
> @@ -243,6 +258,7 @@ EXT4_RO_ATTR_ES_STRING(last_error_func, s_last_error_func, 32);
> EXT4_ATTR(first_error_time, 0444, first_error_time);
> EXT4_ATTR(last_error_time, 0444, last_error_time);
> EXT4_ATTR(journal_task, 0444, journal_task);
> +EXT4_ATTR(journal_barrier, 0444, journal_barrier);
> EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
> EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
> EXT4_RW_ATTR_SBI_UL(last_trim_minblks, s_last_trim_minblks);
> @@ -291,6 +307,7 @@ static struct attribute *ext4_attrs[] = {
> ATTR_LIST(first_error_time),
> ATTR_LIST(last_error_time),
> ATTR_LIST(journal_task),
> + ATTR_LIST(journal_barrier),
> #ifdef CONFIG_EXT4_DEBUG
> ATTR_LIST(simulate_fail),
> #endif
> @@ -438,6 +455,8 @@ static ssize_t ext4_attr_show(struct kobject *kobj,
> return print_tstamp(buf, sbi->s_es, s_last_error_time);
> case attr_journal_task:
> return journal_task_show(sbi, buf);
> + case attr_journal_barrier:
> + return journal_barrier_show(sbi, buf);
> }
>
> return 0;
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 2022/11/28 18:11, Jan Kara wrote:
> On Thu 24-11-22 21:57:44, Zhang Yi wrote:
>> The block layer will check and suppress flush bio if the device write
>> cache is not enabled, so the journal barrier will not go into effect
>> even if uer specify 'barrier=1' mount option. It's dangerous if the
>> write cache state is false negative, and we cannot distinguish such
>> case easily. So just give an info and an inquire interface to let
>> sysadmin know the barrier is suppressed for the case of write cache is
>> not enabled.
>>
>> Signed-off-by: Zhang Yi <[email protected]>
>
> Hum, so have you seen a situation when write cache information is incorrect
> in the block layer? Does it happen often enough that it warrants extra
> sysfs file?
>
Thanks for response. Yes, It often happens on some SCSI devices with RAID
card, the disks below the RAID card enabled write cache, but the RAID driver
declare the write cache was disabled when probing, and the RAID card seems
cannot guarantee data writing back to disk medium on power failure. So the
ext4 filesystem will probably be corrupted at the next startup. It's
difficult to distinguish it's a hardware or an software problem.
I am not familiar with the RAID card. So I don't know why the cache state
is incorrect (maybe incorrect configured or firmware bug).
> After all you should be able to query what the block layer thinks about the
> write cache - you definitely can for SCSI devices, I'm not sure about
> others. So you can have a look there. Providing this info in the filesystem
> seems like doing it in the wrong layer - I don't see anything jbd2/ext4
> specific here...
>
Yes, the best way is to figure out the RAID card problem.
This patch is not to aim to fix something in ext4. The reason why I want to add
this in ext4 is just give a hint from the fs barrier's point of view, it show the
barrier's running state at mount time, could help us to delimit the cache problem
more easily when we found ext4 corruption after power failure. Before this patch,
we could do that through SCSI probing info and /sys/block/sda/queue/write_cache
(maybe some others?), it's not quite clear.
[ 2.520176] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[root@localhost ~]# cat /sys/block/sda/queue/write_cache
write back
Besides, the running state info looks harmless. :)
Thanks,
Yi.
>
>> ---
>> fs/ext4/super.c | 3 +++
>> fs/ext4/sysfs.c | 19 +++++++++++++++++++
>> 2 files changed, 22 insertions(+)
>>
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 7cdd2138c897..916f756ebbca 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -5920,6 +5920,9 @@ static int ext4_load_journal(struct super_block *sb,
>>
>> if (!(journal->j_flags & JBD2_BARRIER))
>> ext4_msg(sb, KERN_INFO, "barriers disabled");
>> + else if (!bdev_write_cache(journal->j_dev))
>> + ext4_msg(sb, KERN_INFO, "journal device write cache disabled, "
>> + "barriers suppressed");
>>
>> if (!ext4_has_feature_journal_needs_recovery(sb))
>> err = jbd2_journal_wipe(journal, !really_read_only);
>> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
>> index d233c24ea342..67f619c1202e 100644
>> --- a/fs/ext4/sysfs.c
>> +++ b/fs/ext4/sysfs.c
>> @@ -37,6 +37,7 @@ typedef enum {
>> attr_pointer_string,
>> attr_pointer_atomic,
>> attr_journal_task,
>> + attr_journal_barrier,
>> } attr_id_t;
>>
>> typedef enum {
>> @@ -135,6 +136,20 @@ static ssize_t journal_task_show(struct ext4_sb_info *sbi, char *buf)
>> task_pid_vnr(sbi->s_journal->j_task));
>> }
>>
>> +static ssize_t journal_barrier_show(struct ext4_sb_info *sbi, char *buf)
>> +{
>> + journal_t *journal = sbi->s_journal;
>> +
>> + if (!journal)
>> + return sysfs_emit(buf, "none\n");
>> +
>> + if (!(journal->j_flags & JBD2_BARRIER))
>> + return sysfs_emit(buf, "disabled\n");
>> + if (!bdev_write_cache(sbi->s_journal->j_dev))
>> + return sysfs_emit(buf, "suppressed\n");
>> + return sysfs_emit(buf, "enabled\n");
>> +}
>> +
>> #define EXT4_ATTR(_name,_mode,_id) \
>> static struct ext4_attr ext4_attr_##_name = { \
>> .attr = {.name = __stringify(_name), .mode = _mode }, \
>> @@ -243,6 +258,7 @@ EXT4_RO_ATTR_ES_STRING(last_error_func, s_last_error_func, 32);
>> EXT4_ATTR(first_error_time, 0444, first_error_time);
>> EXT4_ATTR(last_error_time, 0444, last_error_time);
>> EXT4_ATTR(journal_task, 0444, journal_task);
>> +EXT4_ATTR(journal_barrier, 0444, journal_barrier);
>> EXT4_RW_ATTR_SBI_UI(mb_prefetch, s_mb_prefetch);
>> EXT4_RW_ATTR_SBI_UI(mb_prefetch_limit, s_mb_prefetch_limit);
>> EXT4_RW_ATTR_SBI_UL(last_trim_minblks, s_last_trim_minblks);
>> @@ -291,6 +307,7 @@ static struct attribute *ext4_attrs[] = {
>> ATTR_LIST(first_error_time),
>> ATTR_LIST(last_error_time),
>> ATTR_LIST(journal_task),
>> + ATTR_LIST(journal_barrier),
>> #ifdef CONFIG_EXT4_DEBUG
>> ATTR_LIST(simulate_fail),
>> #endif
>> @@ -438,6 +455,8 @@ static ssize_t ext4_attr_show(struct kobject *kobj,
>> return print_tstamp(buf, sbi->s_es, s_last_error_time);
>> case attr_journal_task:
>> return journal_task_show(sbi, buf);
>> + case attr_journal_barrier:
>> + return journal_barrier_show(sbi, buf);
>> }
>>
>> return 0;
>> --
>> 2.31.1
>>
On Mon 28-11-22 21:01:07, Zhang Yi wrote:
> On 2022/11/28 18:11, Jan Kara wrote:
> > On Thu 24-11-22 21:57:44, Zhang Yi wrote:
> >> The block layer will check and suppress flush bio if the device write
> >> cache is not enabled, so the journal barrier will not go into effect
> >> even if uer specify 'barrier=1' mount option. It's dangerous if the
> >> write cache state is false negative, and we cannot distinguish such
> >> case easily. So just give an info and an inquire interface to let
> >> sysadmin know the barrier is suppressed for the case of write cache is
> >> not enabled.
> >>
> >> Signed-off-by: Zhang Yi <[email protected]>
> >
> > Hum, so have you seen a situation when write cache information is incorrect
> > in the block layer? Does it happen often enough that it warrants extra
> > sysfs file?
> >
>
> Thanks for response. Yes, It often happens on some SCSI devices with RAID
> card, the disks below the RAID card enabled write cache, but the RAID driver
> declare the write cache was disabled when probing, and the RAID card seems
> cannot guarantee data writing back to disk medium on power failure. So the
> ext4 filesystem will probably be corrupted at the next startup. It's
> difficult to distinguish it's a hardware or an software problem.
> I am not familiar with the RAID card. So I don't know why the cache state
> is incorrect (maybe incorrect configured or firmware bug).
OK, thanks for info. I believe usually you're expected to disable write
cache on the disks themselves and leave caching to the RAID card. But I'm
not an expert here and it's a bit besides the point anyway ;)
> > After all you should be able to query what the block layer thinks about the
> > write cache - you definitely can for SCSI devices, I'm not sure about
> > others. So you can have a look there. Providing this info in the filesystem
> > seems like doing it in the wrong layer - I don't see anything jbd2/ext4
> > specific here...
> >
>
> Yes, the best way is to figure out the RAID card problem.
> This patch is not to aim to fix something in ext4. The reason why I want to add
> this in ext4 is just give a hint from the fs barrier's point of view, it show the
> barrier's running state at mount time, could help us to delimit the cache problem
> more easily when we found ext4 corruption after power failure. Before this patch,
> we could do that through SCSI probing info and /sys/block/sda/queue/write_cache
> (maybe some others?), it's not quite clear.
>
> [ 2.520176] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>
> [root@localhost ~]# cat /sys/block/sda/queue/write_cache
> write back
Yes. /sys/block/<device>/queue/write_cache is what you should query to find
whether barriers will be ignored or not. My point is - you need this for
ext4, now if you start using XFS filesystem you'd need similar patch for
XFS and then if you transition to btrfs you'd need this for btrfs as well
and all this duplication is there because you are querying through the
filesystem a property of the underlying block device. So why not ask the
block device directly?
I understand it may be more *convenient* to grab the information from the
filesystem given the infrastructure you have for gathering filesystem
information. But carrying around various sysfs files has its cost as well.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 2022/11/28 23:15, Jan Kara wrote:
> On Mon 28-11-22 21:01:07, Zhang Yi wrote:
>> On 2022/11/28 18:11, Jan Kara wrote:
>>> On Thu 24-11-22 21:57:44, Zhang Yi wrote:
>>>> The block layer will check and suppress flush bio if the device write
>>>> cache is not enabled, so the journal barrier will not go into effect
>>>> even if uer specify 'barrier=1' mount option. It's dangerous if the
>>>> write cache state is false negative, and we cannot distinguish such
>>>> case easily. So just give an info and an inquire interface to let
>>>> sysadmin know the barrier is suppressed for the case of write cache is
>>>> not enabled.
>>>>
>>>> Signed-off-by: Zhang Yi <[email protected]>
>>>
>>> Hum, so have you seen a situation when write cache information is incorrect
>>> in the block layer? Does it happen often enough that it warrants extra
>>> sysfs file?
>>>
>>
>> Thanks for response. Yes, It often happens on some SCSI devices with RAID
>> card, the disks below the RAID card enabled write cache, but the RAID driver
>> declare the write cache was disabled when probing, and the RAID card seems
>> cannot guarantee data writing back to disk medium on power failure. So the
>> ext4 filesystem will probably be corrupted at the next startup. It's
>> difficult to distinguish it's a hardware or an software problem.
>> I am not familiar with the RAID card. So I don't know why the cache state
>> is incorrect (maybe incorrect configured or firmware bug).
>
> OK, thanks for info. I believe usually you're expected to disable write
> cache on the disks themselves and leave caching to the RAID card. But I'm
> not an expert here and it's a bit besides the point anyway ;)
>
>>> After all you should be able to query what the block layer thinks about the
>>> write cache - you definitely can for SCSI devices, I'm not sure about
>>> others. So you can have a look there. Providing this info in the filesystem
>>> seems like doing it in the wrong layer - I don't see anything jbd2/ext4
>>> specific here...
>>>
>>
>> Yes, the best way is to figure out the RAID card problem.
>> This patch is not to aim to fix something in ext4. The reason why I want to add
>> this in ext4 is just give a hint from the fs barrier's point of view, it show the
>> barrier's running state at mount time, could help us to delimit the cache problem
>> more easily when we found ext4 corruption after power failure. Before this patch,
>> we could do that through SCSI probing info and /sys/block/sda/queue/write_cache
>> (maybe some others?), it's not quite clear.
>>
>> [ 2.520176] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
>>
>> [root@localhost ~]# cat /sys/block/sda/queue/write_cache
>> write back
>
> Yes. /sys/block/<device>/queue/write_cache is what you should query to find
> whether barriers will be ignored or not. My point is - you need this for
> ext4, now if you start using XFS filesystem you'd need similar patch for
> XFS and then if you transition to btrfs you'd need this for btrfs as well
> and all this duplication is there because you are querying through the
> filesystem a property of the underlying block device. So why not ask the
> block device directly?
>
> I understand it may be more *convenient* to grab the information from the
> filesystem given the infrastructure you have for gathering filesystem
> information. But carrying around various sysfs files has its cost as well.
>
OK, it's fine, let's keep querying the block layer.
Thanks,
Yi.