2021-09-11 08:51:48

by Ye Bin

[permalink] [raw]
Subject: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time

kmmpd:
...
diff = jiffies - last_update_time;
if (diff > mmp_check_interval * HZ) {
...
As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
than 'mmp_update_interval', so there will never trigger detection.
Introduce last_check_time record previous check time.

Signed-off-by: Ye Bin <[email protected]>
---
fs/ext4/mmp.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
index 12af6dc8457b..c781b09a78c9 100644
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -152,6 +152,7 @@ static int kmmpd(void *data)
int mmp_update_interval = le16_to_cpu(es->s_mmp_update_interval);
unsigned mmp_check_interval;
unsigned long last_update_time;
+ unsigned long last_check_time;
unsigned long diff;
int retval = 0;

@@ -170,6 +171,7 @@ static int kmmpd(void *data)

memcpy(mmp->mmp_nodename, init_utsname()->nodename,
sizeof(mmp->mmp_nodename));
+ last_check_time = jiffies;

while (!kthread_should_stop() && !sb_rdonly(sb)) {
if (!ext4_has_feature_mmp(sb)) {
@@ -198,17 +200,18 @@ static int kmmpd(void *data)
}

diff = jiffies - last_update_time;
- if (diff < mmp_update_interval * HZ)
+ if (diff < mmp_update_interval * HZ) {
schedule_timeout_interruptible(mmp_update_interval *
HZ - diff);
+ diff = jiffies - last_update_time;
+ }

/*
* We need to make sure that more than mmp_check_interval
- * seconds have not passed since writing. If that has happened
- * we need to check if the MMP block is as we left it.
+ * seconds have not passed since check. If that has happened
+ * we need to check if the MMP block is as we write it.
*/
- diff = jiffies - last_update_time;
- if (diff > mmp_check_interval * HZ) {
+ if (jiffies - last_check_time > mmp_check_interval * HZ) {
struct buffer_head *bh_check = NULL;
struct mmp_struct *mmp_check;

@@ -234,6 +237,7 @@ static int kmmpd(void *data)
goto wait_to_exit;
}
put_bh(bh_check);
+ last_check_time = jiffies;
}

/*
--
2.31.1


2021-10-07 12:31:36

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time

On Sat 11-09-21 17:00:55, Ye Bin wrote:
> kmmpd:
> ...
> diff = jiffies - last_update_time;
> if (diff > mmp_check_interval * HZ) {
> ...
> As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
> than 'mmp_update_interval', so there will never trigger detection.
> Introduce last_check_time record previous check time.
>
> Signed-off-by: Ye Bin <[email protected]>

I think the check is there only for the case where write_mmp_block() +
sleep took longer than mmp_check_interval. I agree that should rarely
happen but on a really busy system it is possible and in that case we would
miss updating mmp block for too long and so another node could have started
using the filesystem. I actually don't see a reason why kmmpd should be
checking the block each mmp_check_interval as you do - mmp_check_interval
is just for ext4_multi_mount_protect() to know how long it should wait
before considering mmp block stale... Am I missing something?

Honza

> ---
> fs/ext4/mmp.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
> index 12af6dc8457b..c781b09a78c9 100644
> --- a/fs/ext4/mmp.c
> +++ b/fs/ext4/mmp.c
> @@ -152,6 +152,7 @@ static int kmmpd(void *data)
> int mmp_update_interval = le16_to_cpu(es->s_mmp_update_interval);
> unsigned mmp_check_interval;
> unsigned long last_update_time;
> + unsigned long last_check_time;
> unsigned long diff;
> int retval = 0;
>
> @@ -170,6 +171,7 @@ static int kmmpd(void *data)
>
> memcpy(mmp->mmp_nodename, init_utsname()->nodename,
> sizeof(mmp->mmp_nodename));
> + last_check_time = jiffies;
>
> while (!kthread_should_stop() && !sb_rdonly(sb)) {
> if (!ext4_has_feature_mmp(sb)) {
> @@ -198,17 +200,18 @@ static int kmmpd(void *data)
> }
>
> diff = jiffies - last_update_time;
> - if (diff < mmp_update_interval * HZ)
> + if (diff < mmp_update_interval * HZ) {
> schedule_timeout_interruptible(mmp_update_interval *
> HZ - diff);
> + diff = jiffies - last_update_time;
> + }
>
> /*
> * We need to make sure that more than mmp_check_interval
> - * seconds have not passed since writing. If that has happened
> - * we need to check if the MMP block is as we left it.
> + * seconds have not passed since check. If that has happened
> + * we need to check if the MMP block is as we write it.
> */
> - diff = jiffies - last_update_time;
> - if (diff > mmp_check_interval * HZ) {
> + if (jiffies - last_check_time > mmp_check_interval * HZ) {
> struct buffer_head *bh_check = NULL;
> struct mmp_struct *mmp_check;
>
> @@ -234,6 +237,7 @@ static int kmmpd(void *data)
> goto wait_to_exit;
> }
> put_bh(bh_check);
> + last_check_time = jiffies;
> }
>
> /*
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-10-08 01:57:12

by Ye Bin

[permalink] [raw]
Subject: Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time



On 2021/10/7 20:31, Jan Kara wrote:
> On Sat 11-09-21 17:00:55, Ye Bin wrote:
>> kmmpd:
>> ...
>> diff = jiffies - last_update_time;
>> if (diff > mmp_check_interval * HZ) {
>> ...
>> As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
>> than 'mmp_update_interval', so there will never trigger detection.
>> Introduce last_check_time record previous check time.
>>
>> Signed-off-by: Ye Bin <[email protected]>
> I think the check is there only for the case where write_mmp_block() +
> sleep took longer than mmp_check_interval. I agree that should rarely
> happen but on a really busy system it is possible and in that case we would
> miss updating mmp block for too long and so another node could have started
> using the filesystem. I actually don't see a reason why kmmpd should be
> checking the block each mmp_check_interval as you do - mmp_check_interval
> is just for ext4_multi_mount_protect() to know how long it should wait
> before considering mmp block stale... Am I missing something?
>
> Honza
I'm sorry, I didn't understand the detection mechanism here before. Now
I understand
the detection mechanism here.
As you said, it's just an abnormal protection. There's really no problem.

>> ---
>> fs/ext4/mmp.c | 14 +++++++++-----
>> 1 file changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
>> index 12af6dc8457b..c781b09a78c9 100644
>> --- a/fs/ext4/mmp.c
>> +++ b/fs/ext4/mmp.c
>> @@ -152,6 +152,7 @@ static int kmmpd(void *data)
>> int mmp_update_interval = le16_to_cpu(es->s_mmp_update_interval);
>> unsigned mmp_check_interval;
>> unsigned long last_update_time;
>> + unsigned long last_check_time;
>> unsigned long diff;
>> int retval = 0;
>>
>> @@ -170,6 +171,7 @@ static int kmmpd(void *data)
>>
>> memcpy(mmp->mmp_nodename, init_utsname()->nodename,
>> sizeof(mmp->mmp_nodename));
>> + last_check_time = jiffies;
>>
>> while (!kthread_should_stop() && !sb_rdonly(sb)) {
>> if (!ext4_has_feature_mmp(sb)) {
>> @@ -198,17 +200,18 @@ static int kmmpd(void *data)
>> }
>>
>> diff = jiffies - last_update_time;
>> - if (diff < mmp_update_interval * HZ)
>> + if (diff < mmp_update_interval * HZ) {
>> schedule_timeout_interruptible(mmp_update_interval *
>> HZ - diff);
>> + diff = jiffies - last_update_time;
>> + }
>>
>> /*
>> * We need to make sure that more than mmp_check_interval
>> - * seconds have not passed since writing. If that has happened
>> - * we need to check if the MMP block is as we left it.
>> + * seconds have not passed since check. If that has happened
>> + * we need to check if the MMP block is as we write it.
>> */
>> - diff = jiffies - last_update_time;
>> - if (diff > mmp_check_interval * HZ) {
>> + if (jiffies - last_check_time > mmp_check_interval * HZ) {
>> struct buffer_head *bh_check = NULL;
>> struct mmp_struct *mmp_check;
>>
>> @@ -234,6 +237,7 @@ static int kmmpd(void *data)
>> goto wait_to_exit;
>> }
>> put_bh(bh_check);
>> + last_check_time = jiffies;
>> }
>>
>> /*
>> --
>> 2.31.1
>>

2021-10-08 02:44:05

by Ye Bin

[permalink] [raw]
Subject: Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time



On 2021/10/8 9:56, yebin wrote:
>
>
> On 2021/10/7 20:31, Jan Kara wrote:
>> On Sat 11-09-21 17:00:55, Ye Bin wrote:
>>> kmmpd:
>>> ...
>>> diff = jiffies - last_update_time;
>>> if (diff > mmp_check_interval * HZ) {
>>> ...
>>> As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
>>> than 'mmp_update_interval', so there will never trigger detection.
>>> Introduce last_check_time record previous check time.
>>>
>>> Signed-off-by: Ye Bin <[email protected]>
>> I think the check is there only for the case where write_mmp_block() +
>> sleep took longer than mmp_check_interval. I agree that should rarely
>> happen but on a really busy system it is possible and in that case we
>> would
>> miss updating mmp block for too long and so another node could have
>> started
>> using the filesystem. I actually don't see a reason why kmmpd should be
>> checking the block each mmp_check_interval as you do -
>> mmp_check_interval
>> is just for ext4_multi_mount_protect() to know how long it should wait
>> before considering mmp block stale... Am I missing something?
>>
>> Honza
> I'm sorry, I didn't understand the detection mechanism here before.
> Now I understand
> the detection mechanism here.
> As you said, it's just an abnormal protection. There's really no problem.
>
Yeah, i did test as following steps
hostA hostB
mount
ext4_multi_mount_protect -> seq == EXT4_MMP_SEQ_CLEAN
delay 5s after label "skip" so hostB will see seq is
EXT4_MMP_SEQ_CLEAN
mount
ext4_multi_mount_protect -> seq ==
EXT4_MMP_SEQ_CLEAN
run kmmpd
run kmmpd

Actually,in this situation kmmpd will not detect confliction.
In ext4_multi_mount_protect function we write mmp data fisrt and wait
'wait_time * HZ' seconds,
read mmp data do check.Most of the time, If 'wait_time' is zero, it can
pass check.
>>> ---
>>> fs/ext4/mmp.c | 14 +++++++++-----
>>> 1 file changed, 9 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
>>> index 12af6dc8457b..c781b09a78c9 100644
>>> --- a/fs/ext4/mmp.c
>>> +++ b/fs/ext4/mmp.c
>>> @@ -152,6 +152,7 @@ static int kmmpd(void *data)
>>> int mmp_update_interval = le16_to_cpu(es->s_mmp_update_interval);
>>> unsigned mmp_check_interval;
>>> unsigned long last_update_time;
>>> + unsigned long last_check_time;
>>> unsigned long diff;
>>> int retval = 0;
>>> @@ -170,6 +171,7 @@ static int kmmpd(void *data)
>>> memcpy(mmp->mmp_nodename, init_utsname()->nodename,
>>> sizeof(mmp->mmp_nodename));
>>> + last_check_time = jiffies;
>>> while (!kthread_should_stop() && !sb_rdonly(sb)) {
>>> if (!ext4_has_feature_mmp(sb)) {
>>> @@ -198,17 +200,18 @@ static int kmmpd(void *data)
>>> }
>>> diff = jiffies - last_update_time;
>>> - if (diff < mmp_update_interval * HZ)
>>> + if (diff < mmp_update_interval * HZ) {
>>> schedule_timeout_interruptible(mmp_update_interval *
>>> HZ - diff);
>>> + diff = jiffies - last_update_time;
>>> + }
>>> /*
>>> * We need to make sure that more than mmp_check_interval
>>> - * seconds have not passed since writing. If that has happened
>>> - * we need to check if the MMP block is as we left it.
>>> + * seconds have not passed since check. If that has happened
>>> + * we need to check if the MMP block is as we write it.
>>> */
>>> - diff = jiffies - last_update_time;
>>> - if (diff > mmp_check_interval * HZ) {
>>> + if (jiffies - last_check_time > mmp_check_interval * HZ) {
>>> struct buffer_head *bh_check = NULL;
>>> struct mmp_struct *mmp_check;
>>> @@ -234,6 +237,7 @@ static int kmmpd(void *data)
>>> goto wait_to_exit;
>>> }
>>> put_bh(bh_check);
>>> + last_check_time = jiffies;
>>> }
>>> /*
>>> --
>>> 2.31.1
>>>
>

2021-10-12 08:49:18

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time

On Fri 08-10-21 10:38:31, yebin wrote:
> On 2021/10/8 9:56, yebin wrote:
> > On 2021/10/7 20:31, Jan Kara wrote:
> > > On Sat 11-09-21 17:00:55, Ye Bin wrote:
> > > > kmmpd:
> > > > ...
> > > > diff = jiffies - last_update_time;
> > > > if (diff > mmp_check_interval * HZ) {
> > > > ...
> > > > As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
> > > > than 'mmp_update_interval', so there will never trigger detection.
> > > > Introduce last_check_time record previous check time.
> > > >
> > > > Signed-off-by: Ye Bin <[email protected]>
> > > I think the check is there only for the case where write_mmp_block() +
> > > sleep took longer than mmp_check_interval. I agree that should rarely
> > > happen but on a really busy system it is possible and in that case
> > > we would
> > > miss updating mmp block for too long and so another node could have
> > > started
> > > using the filesystem. I actually don't see a reason why kmmpd should be
> > > checking the block each mmp_check_interval as you do -
> > > mmp_check_interval
> > > is just for ext4_multi_mount_protect() to know how long it should wait
> > > before considering mmp block stale... Am I missing something?
> > >
> > > Honza
> > I'm sorry, I didn't understand the detection mechanism here before. Now
> > I understand
> > the detection mechanism here.
> > As you said, it's just an abnormal protection. There's really no problem.
> >
> Yeah, i did test as following steps
> hostA hostB
> mount
> ext4_multi_mount_protect -> seq == EXT4_MMP_SEQ_CLEAN
> delay 5s after label "skip" so hostB will see seq is
> EXT4_MMP_SEQ_CLEAN
> mount
> ext4_multi_mount_protect -> seq == EXT4_MMP_SEQ_CLEAN
> run kmmpd
> run kmmpd
>
> Actually,in this situation kmmpd will not detect confliction.
> In ext4_multi_mount_protect function we write mmp data first and wait
> 'wait_time * HZ' seconds,
> read mmp data do check. Most of the time, If 'wait_time' is zero, it can pass
> check.

But how can be wait_time zero? As far as I'm reading the code, wait_time
must be at least EXT4_MMP_MIN_CHECK_INTERVAL...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-10-12 11:47:20

by Ye Bin

[permalink] [raw]
Subject: Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time



On 2021/10/12 16:47, Jan Kara wrote:
> On Fri 08-10-21 10:38:31, yebin wrote:
>> On 2021/10/8 9:56, yebin wrote:
>>> On 2021/10/7 20:31, Jan Kara wrote:
>>>> On Sat 11-09-21 17:00:55, Ye Bin wrote:
>>>>> kmmpd:
>>>>> ...
>>>>> diff = jiffies - last_update_time;
>>>>> if (diff > mmp_check_interval * HZ) {
>>>>> ...
>>>>> As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
>>>>> than 'mmp_update_interval', so there will never trigger detection.
>>>>> Introduce last_check_time record previous check time.
>>>>>
>>>>> Signed-off-by: Ye Bin <[email protected]>
>>>> I think the check is there only for the case where write_mmp_block() +
>>>> sleep took longer than mmp_check_interval. I agree that should rarely
>>>> happen but on a really busy system it is possible and in that case
>>>> we would
>>>> miss updating mmp block for too long and so another node could have
>>>> started
>>>> using the filesystem. I actually don't see a reason why kmmpd should be
>>>> checking the block each mmp_check_interval as you do -
>>>> mmp_check_interval
>>>> is just for ext4_multi_mount_protect() to know how long it should wait
>>>> before considering mmp block stale... Am I missing something?
>>>>
>>>> Honza
>>> I'm sorry, I didn't understand the detection mechanism here before. Now
>>> I understand
>>> the detection mechanism here.
>>> As you said, it's just an abnormal protection. There's really no problem.
>>>
>> Yeah, i did test as following steps
>> hostA hostB
>> mount
>> ext4_multi_mount_protect -> seq == EXT4_MMP_SEQ_CLEAN
>> delay 5s after label "skip" so hostB will see seq is
>> EXT4_MMP_SEQ_CLEAN
>> mount
>> ext4_multi_mount_protect -> seq == EXT4_MMP_SEQ_CLEAN
>> run kmmpd
>> run kmmpd
>>
>> Actually,in this situation kmmpd will not detect confliction.
>> In ext4_multi_mount_protect function we write mmp data first and wait
>> 'wait_time * HZ' seconds,
>> read mmp data do check. Most of the time, If 'wait_time' is zero, it can pass
>> check.
> But how can be wait_time zero? As far as I'm reading the code, wait_time
> must be at least EXT4_MMP_MIN_CHECK_INTERVAL...
>
> Honza
int ext4_multi_mount_protect(struct super_block *sb,
ext4_fsblk_t mmp_block)
{
struct ext4_super_block *es = EXT4_SB(sb)->s_es;
struct buffer_head *bh = NULL;
struct mmp_struct *mmp = NULL;
u32 seq;
unsigned int mmp_check_interval =
le16_to_cpu(es->s_mmp_update_interval);
unsigned int wait_time = 0; --> wait_time
is equal with zero
int retval;

if (mmp_block < le32_to_cpu(es->s_first_data_block) ||
mmp_block >= ext4_blocks_count(es)) {
ext4_warning(sb, "Invalid MMP block in superblock");
goto failed;
}

retval = read_mmp_block(sb, &bh, mmp_block);
if (retval)
goto failed;

mmp = (struct mmp_struct *)(bh->b_data);

if (mmp_check_interval < EXT4_MMP_MIN_CHECK_INTERVAL)
mmp_check_interval = EXT4_MMP_MIN_CHECK_INTERVAL;

/*
* If check_interval in MMP block is larger, use that instead of
* update_interval from the superblock.
*/
if (le16_to_cpu(mmp->mmp_check_interval) > mmp_check_interval)
mmp_check_interval = le16_to_cpu(mmp->mmp_check_interval);

seq = le32_to_cpu(mmp->mmp_seq);
if (seq == EXT4_MMP_SEQ_CLEAN) --> If hostA and hostB mount
the same block device at the same time,
--> HostA and hostB maybe get 'seq' with the same value
EXT4_MMP_SEQ_CLEAN.
goto skip;
...
skip:
/*
* write a new random sequence number.
*/
seq = mmp_new_seq();
mmp->mmp_seq = cpu_to_le32(seq);

retval = write_mmp_block(sb, bh);
if (retval)
goto failed;

/*
* wait for MMP interval and check mmp_seq.
*/
if (schedule_timeout_interruptible(HZ * wait_time) != 0)
{ --> If seq is equal with EXT4_MMP_SEQ_CLEAN, wait_time is zero.
ext4_warning(sb, "MMP startup interrupted, failing mount");
goto failed;
}

retval = read_mmp_block(sb, &bh, mmp_block); -->We may get the
same data with which we wrote, so we can't detect conflict at here.
if (retval)
goto failed;
mmp = (struct mmp_struct *)(bh->b_data);
if (seq != le32_to_cpu(mmp->mmp_seq)) {
dump_mmp_msg(sb, mmp,
"Device is already active on another node.");
goto failed;
}
...
}

2021-10-13 09:39:15

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time

On Tue 12-10-21 19:46:24, yebin wrote:
> On 2021/10/12 16:47, Jan Kara wrote:
> > On Fri 08-10-21 10:38:31, yebin wrote:
> > > On 2021/10/8 9:56, yebin wrote:
> > > > On 2021/10/7 20:31, Jan Kara wrote:
> > > > > On Sat 11-09-21 17:00:55, Ye Bin wrote:
> > > > > > kmmpd:
> > > > > > ...
> > > > > > diff = jiffies - last_update_time;
> > > > > > if (diff > mmp_check_interval * HZ) {
> > > > > > ...
> > > > > > As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
> > > > > > than 'mmp_update_interval', so there will never trigger detection.
> > > > > > Introduce last_check_time record previous check time.
> > > > > >
> > > > > > Signed-off-by: Ye Bin <[email protected]>
> > > > > I think the check is there only for the case where write_mmp_block() +
> > > > > sleep took longer than mmp_check_interval. I agree that should rarely
> > > > > happen but on a really busy system it is possible and in that case
> > > > > we would
> > > > > miss updating mmp block for too long and so another node could have
> > > > > started
> > > > > using the filesystem. I actually don't see a reason why kmmpd should be
> > > > > checking the block each mmp_check_interval as you do -
> > > > > mmp_check_interval
> > > > > is just for ext4_multi_mount_protect() to know how long it should wait
> > > > > before considering mmp block stale... Am I missing something?
> > > > >
> > > > > Honza
> > > > I'm sorry, I didn't understand the detection mechanism here before. Now
> > > > I understand
> > > > the detection mechanism here.
> > > > As you said, it's just an abnormal protection. There's really no problem.
> > > >
> > > Yeah, i did test as following steps
> > > hostA hostB
> > > mount
> > > ext4_multi_mount_protect -> seq == EXT4_MMP_SEQ_CLEAN
> > > delay 5s after label "skip" so hostB will see seq is
> > > EXT4_MMP_SEQ_CLEAN
> > > mount
> > > ext4_multi_mount_protect -> seq == EXT4_MMP_SEQ_CLEAN
> > > run kmmpd
> > > run kmmpd
> > >
> > > Actually,in this situation kmmpd will not detect confliction.
> > > In ext4_multi_mount_protect function we write mmp data first and wait
> > > 'wait_time * HZ' seconds,
> > > read mmp data do check. Most of the time, If 'wait_time' is zero, it can pass
> > > check.
> > But how can be wait_time zero? As far as I'm reading the code, wait_time
> > must be at least EXT4_MMP_MIN_CHECK_INTERVAL...
> >
> > Honza
> int ext4_multi_mount_protect(struct super_block *sb,
> ext4_fsblk_t mmp_block)
> {
> struct ext4_super_block *es = EXT4_SB(sb)->s_es;
> struct buffer_head *bh = NULL;
> struct mmp_struct *mmp = NULL;
> u32 seq;
> unsigned int mmp_check_interval =
> le16_to_cpu(es->s_mmp_update_interval);
> unsigned int wait_time = 0; --> wait_time is
> equal with zero
> int retval;
>
> if (mmp_block < le32_to_cpu(es->s_first_data_block) ||
> mmp_block >= ext4_blocks_count(es)) {
> ext4_warning(sb, "Invalid MMP block in superblock");
> goto failed;
> }
>
> retval = read_mmp_block(sb, &bh, mmp_block);
> if (retval)
> goto failed;
>
> mmp = (struct mmp_struct *)(bh->b_data);
>
> if (mmp_check_interval < EXT4_MMP_MIN_CHECK_INTERVAL)
> mmp_check_interval = EXT4_MMP_MIN_CHECK_INTERVAL;
>
> /*
> * If check_interval in MMP block is larger, use that instead of
> * update_interval from the superblock.
> */
> if (le16_to_cpu(mmp->mmp_check_interval) > mmp_check_interval)
> mmp_check_interval = le16_to_cpu(mmp->mmp_check_interval);
>
> seq = le32_to_cpu(mmp->mmp_seq);
> if (seq == EXT4_MMP_SEQ_CLEAN) --> If hostA and hostB mount the
> same block device at the same time,
> --> HostA and hostB maybe get 'seq' with the same value EXT4_MMP_SEQ_CLEAN.
> goto skip;

Oh, I see. Thanks for explanation.

> ...
> skip:
> /*
> * write a new random sequence number.
> */
> seq = mmp_new_seq();
> mmp->mmp_seq = cpu_to_le32(seq);
>
> retval = write_mmp_block(sb, bh);
> if (retval)
> goto failed;
>
> /*
> * wait for MMP interval and check mmp_seq.
> */
> if (schedule_timeout_interruptible(HZ * wait_time) != 0) {
> --> If seq is equal with EXT4_MMP_SEQ_CLEAN, wait_time is zero.
> ext4_warning(sb, "MMP startup interrupted, failing mount");
> goto failed;
> }
>
> retval = read_mmp_block(sb, &bh, mmp_block); -->We may get the same
> data with which we wrote, so we can't detect conflict at here.

OK, I see. So the race in ext4_multi_mount_protect() goes like:

hostA hostB

read_mmp_block() read_mmp_block()
- sees EXT4_MMP_SEQ_CLEAN - sees EXT4_MMP_SEQ_CLEAN
write_mmp_block()
wait_time == 0 -> no wait
read_mmp_block()
- all OK, mount
write_mmp_block()
wait_time == 0 -> no wait
read_mmp_block()
- all OK, mount

Do I get it right? Actually, if we passed seq we wrote in
ext4_multi_mount_protect() to kmmpd (probably in sb), then kmmpd would
notice the conflict on its first invocation but still that would be a bit
late because there would be a time window where hostA and hostB would be
both using the fs.

We could reduce the likelyhood of this race by always waiting in
ext4_multi_mount_protect() between write & read but I guess that is
undesirable as it would slow down all clean mounts. Ted?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-10-13 12:34:20

by Ye Bin

[permalink] [raw]
Subject: Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time



On 2021/10/13 17:38, Jan Kara wrote:
> On Tue 12-10-21 19:46:24, yebin wrote:
>> On 2021/10/12 16:47, Jan Kara wrote:
>>> On Fri 08-10-21 10:38:31, yebin wrote:
>>>> On 2021/10/8 9:56, yebin wrote:
>>>>> On 2021/10/7 20:31, Jan Kara wrote:
>>>>>> On Sat 11-09-21 17:00:55, Ye Bin wrote:
>>>>>>> kmmpd:
>>>>>>> ...
>>>>>>> diff = jiffies - last_update_time;
>>>>>>> if (diff > mmp_check_interval * HZ) {
>>>>>>> ...
>>>>>>> As "mmp_check_interval = 2 * mmp_update_interval", 'diff' always little
>>>>>>> than 'mmp_update_interval', so there will never trigger detection.
>>>>>>> Introduce last_check_time record previous check time.
>>>>>>>
>>>>>>> Signed-off-by: Ye Bin <[email protected]>
>>>>>> I think the check is there only for the case where write_mmp_block() +
>>>>>> sleep took longer than mmp_check_interval. I agree that should rarely
>>>>>> happen but on a really busy system it is possible and in that case
>>>>>> we would
>>>>>> miss updating mmp block for too long and so another node could have
>>>>>> started
>>>>>> using the filesystem. I actually don't see a reason why kmmpd should be
>>>>>> checking the block each mmp_check_interval as you do -
>>>>>> mmp_check_interval
>>>>>> is just for ext4_multi_mount_protect() to know how long it should wait
>>>>>> before considering mmp block stale... Am I missing something?
>>>>>>
>>>>>> Honza
>>>>> I'm sorry, I didn't understand the detection mechanism here before. Now
>>>>> I understand
>>>>> the detection mechanism here.
>>>>> As you said, it's just an abnormal protection. There's really no problem.
>>>>>
>>>> Yeah, i did test as following steps
>>>> hostA hostB
>>>> mount
>>>> ext4_multi_mount_protect -> seq == EXT4_MMP_SEQ_CLEAN
>>>> delay 5s after label "skip" so hostB will see seq is
>>>> EXT4_MMP_SEQ_CLEAN
>>>> mount
>>>> ext4_multi_mount_protect -> seq == EXT4_MMP_SEQ_CLEAN
>>>> run kmmpd
>>>> run kmmpd
>>>>
>>>> Actually,in this situation kmmpd will not detect confliction.
>>>> In ext4_multi_mount_protect function we write mmp data first and wait
>>>> 'wait_time * HZ' seconds,
>>>> read mmp data do check. Most of the time, If 'wait_time' is zero, it can pass
>>>> check.
>>> But how can be wait_time zero? As far as I'm reading the code, wait_time
>>> must be at least EXT4_MMP_MIN_CHECK_INTERVAL...
>>>
>>> Honza
>> int ext4_multi_mount_protect(struct super_block *sb,
>> ext4_fsblk_t mmp_block)
>> {
>> struct ext4_super_block *es = EXT4_SB(sb)->s_es;
>> struct buffer_head *bh = NULL;
>> struct mmp_struct *mmp = NULL;
>> u32 seq;
>> unsigned int mmp_check_interval =
>> le16_to_cpu(es->s_mmp_update_interval);
>> unsigned int wait_time = 0; --> wait_time is
>> equal with zero
>> int retval;
>>
>> if (mmp_block < le32_to_cpu(es->s_first_data_block) ||
>> mmp_block >= ext4_blocks_count(es)) {
>> ext4_warning(sb, "Invalid MMP block in superblock");
>> goto failed;
>> }
>>
>> retval = read_mmp_block(sb, &bh, mmp_block);
>> if (retval)
>> goto failed;
>>
>> mmp = (struct mmp_struct *)(bh->b_data);
>>
>> if (mmp_check_interval < EXT4_MMP_MIN_CHECK_INTERVAL)
>> mmp_check_interval = EXT4_MMP_MIN_CHECK_INTERVAL;
>>
>> /*
>> * If check_interval in MMP block is larger, use that instead of
>> * update_interval from the superblock.
>> */
>> if (le16_to_cpu(mmp->mmp_check_interval) > mmp_check_interval)
>> mmp_check_interval = le16_to_cpu(mmp->mmp_check_interval);
>>
>> seq = le32_to_cpu(mmp->mmp_seq);
>> if (seq == EXT4_MMP_SEQ_CLEAN) --> If hostA and hostB mount the
>> same block device at the same time,
>> --> HostA and hostB maybe get 'seq' with the same value EXT4_MMP_SEQ_CLEAN.
>> goto skip;
> Oh, I see. Thanks for explanation.
>
>> ...
>> skip:
>> /*
>> * write a new random sequence number.
>> */
>> seq = mmp_new_seq();
>> mmp->mmp_seq = cpu_to_le32(seq);
>>
>> retval = write_mmp_block(sb, bh);
>> if (retval)
>> goto failed;
>>
>> /*
>> * wait for MMP interval and check mmp_seq.
>> */
>> if (schedule_timeout_interruptible(HZ * wait_time) != 0) {
>> --> If seq is equal with EXT4_MMP_SEQ_CLEAN, wait_time is zero.
>> ext4_warning(sb, "MMP startup interrupted, failing mount");
>> goto failed;
>> }
>>
>> retval = read_mmp_block(sb, &bh, mmp_block); -->We may get the same
>> data with which we wrote, so we can't detect conflict at here.
> OK, I see. So the race in ext4_multi_mount_protect() goes like:
>
> hostA hostB
>
> read_mmp_block() read_mmp_block()
> - sees EXT4_MMP_SEQ_CLEAN - sees EXT4_MMP_SEQ_CLEAN
> write_mmp_block()
> wait_time == 0 -> no wait
> read_mmp_block()
> - all OK, mount
> write_mmp_block()
> wait_time == 0 -> no wait
> read_mmp_block()
> - all OK, mount
Yes, that's what i mean.
>
> Do I get it right? Actually, if we passed seq we wrote in
> ext4_multi_mount_protect() to kmmpd (probably in sb), then kmmpd would
> notice the conflict on its first invocation but still that would be a bit
> late because there would be a time window where hostA and hostB would be
> both using the fs.
>
> We could reduce the likelyhood of this race by always waiting in
> ext4_multi_mount_protect() between write & read but I guess that is
> undesirable as it would slow down all clean mounts. Ted?
>
> Honza

2021-10-13 21:42:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time

On Wed, Oct 13, 2021 at 11:38:47AM +0200, Jan Kara wrote:
>
> OK, I see. So the race in ext4_multi_mount_protect() goes like:
>
> hostA hostB
>
> read_mmp_block() read_mmp_block()
> - sees EXT4_MMP_SEQ_CLEAN - sees EXT4_MMP_SEQ_CLEAN
> write_mmp_block()
> wait_time == 0 -> no wait
> read_mmp_block()
> - all OK, mount
> write_mmp_block()
> wait_time == 0 -> no wait
> read_mmp_block()
> - all OK, mount
>
> Do I get it right? Actually, if we passed seq we wrote in
> ext4_multi_mount_protect() to kmmpd (probably in sb), then kmmpd would
> notice the conflict on its first invocation but still that would be a bit
> late because there would be a time window where hostA and hostB would be
> both using the fs.
>
> We could reduce the likelyhood of this race by always waiting in
> ext4_multi_mount_protect() between write & read but I guess that is
> undesirable as it would slow down all clean mounts. Ted?

I'd like Andreas to comment here. My understanding is that MMP
originally intended as a safety mechanism which would be used as part
of a primary/backup high availability system, but not as the *primary*
system where you might try to have two servers simultaneously try to
mount the file system and use MMP as the "election" mechanism to
decide which server is going to be the primary system, and which would
be the backup system.

The cost of being able to handle this particular race is it would slow
down the mounts of cleanly unmounted systems.

There *are* better systems to implement leader elections[1] than using
MMP. Most of these more efficient leader elections assume that you
have a working IP network, and so if you have a separate storage
network (including a shared SCSI bus) from your standard IP network,
then MMP is a useful failsafe in the face of a network partition of
your IP network. The question is whether MMP should be useful for
more than that. And if it isn't, then we should probably document
what MMP is and isn't good for, and give advice in the form of an
application note for how MMP should be used in the context of a larger
system.

[1] https://en.wikipedia.org/wiki/Leader_election

- Ted

2021-10-15 08:17:47

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time

On Oct 13, 2021, at 3:41 PM, Theodore Ts'o <[email protected]> wrote:
>
> On Wed, Oct 13, 2021 at 11:38:47AM +0200, Jan Kara wrote:
>>
>> OK, I see. So the race in ext4_multi_mount_protect() goes like:
>>
>> hostA hostB
>>
>> read_mmp_block() read_mmp_block()
>> - sees EXT4_MMP_SEQ_CLEAN - sees EXT4_MMP_SEQ_CLEAN
>> write_mmp_block()
>> wait_time == 0 -> no wait
>> read_mmp_block()
>> - all OK, mount
>> write_mmp_block()
>> wait_time == 0 -> no wait
>> read_mmp_block()
>> - all OK, mount
>>
>> Do I get it right? Actually, if we passed seq we wrote in
>> ext4_multi_mount_protect() to kmmpd (probably in sb), then kmmpd would
>> notice the conflict on its first invocation but still that would be a bit
>> late because there would be a time window where hostA and hostB would be
>> both using the fs.

It would be enough to have even a short delay between write and read to
detect this case. I _thought_ there should be a delay in this case,
but maybe it was removed after the patch was originally submitted?

>> We could reduce the likelyhood of this race by always waiting in
>> ext4_multi_mount_protect() between write & read but I guess that is
>> undesirable as it would slow down all clean mounts. Ted?
>
> I'd like Andreas to comment here. My understanding is that MMP
> originally intended as a safety mechanism which would be used as part
> of a primary/backup high availability system, but not as the *primary*
> system where you might try to have two servers simultaneously try to
> mount the file system and use MMP as the "election" mechanism to
> decide which server is going to be the primary system, and which would
> be the backup system.
>
> The cost of being able to handle this particular race is it would slow
> down the mounts of cleanly unmounted systems.

Ted's understanding is correct - MMP is intended to be a backup mechanism
to prevent filesystem corruption in the case where external HA methods
do the wrong thing. This has avoided problems countless times on systems
with multi-port access to the same storage, and can also be useful in the
case of shared VM images accessed over the network, and similar.

When MMP was implemented for ZFS, a slightly different mechanism was used.
Rather than having the delay to detect concurrent mounts, it instead writes
to multiple different blocks in a random order, and then reads them all.
If two nodes try to mount the filesystem concurrently, they would pick
different block orders, and the chance of them having the same order (and
one clobbering all of the blocks of the other) would be 1/2^num_blocks.
The drawback is that this would consume more space in the filesystem, but
it wouldn't be a huge deal these days.

> There *are* better systems to implement leader elections[1] than using
> MMP. Most of these more efficient leader elections assume that you
> have a working IP network, and so if you have a separate storage
> network (including a shared SCSI bus) from your standard IP network,
> then MMP is a useful failsafe in the face of a network partition of
> your IP network. The question is whether MMP should be useful for
> more than that. And if it isn't, then we should probably document
> what MMP is and isn't good for, and give advice in the form of an
> application note for how MMP should be used in the context of a larger
> system.

One of the existing failure cases with HA that MMP detects is loss of
network connection, so I wouldn't want to depend on that.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2021-10-15 08:17:48

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH -next v2 2/6] ext4: introduce last_check_time record previous check time

On Oct 13, 2021, at 3:41 PM, Theodore Ts'o <[email protected]> wrote:
>
> On Wed, Oct 13, 2021 at 11:38:47AM +0200, Jan Kara wrote:
>>
>> OK, I see. So the race in ext4_multi_mount_protect() goes like:
>>
>> hostA hostB
>>
>> read_mmp_block() read_mmp_block()
>> - sees EXT4_MMP_SEQ_CLEAN - sees EXT4_MMP_SEQ_CLEAN
>> write_mmp_block()
>> wait_time == 0 -> no wait
>> read_mmp_block()
>> - all OK, mount
>> write_mmp_block()
>> wait_time == 0 -> no wait
>> read_mmp_block()
>> - all OK, mount
>>
>> Do I get it right? Actually, if we passed seq we wrote in
>> ext4_multi_mount_protect() to kmmpd (probably in sb), then kmmpd would
>> notice the conflict on its first invocation but still that would be a bit
>> late because there would be a time window where hostA and hostB would be
>> both using the fs.

It would be enough to have even a short delay between write and read to
detect this case. I _thought_ there should be a delay in this case,
but maybe it was removed after the patch was originally submitted?

>> We could reduce the likelyhood of this race by always waiting in
>> ext4_multi_mount_protect() between write & read but I guess that is
>> undesirable as it would slow down all clean mounts. Ted?
>
> I'd like Andreas to comment here. My understanding is that MMP
> originally intended as a safety mechanism which would be used as part
> of a primary/backup high availability system, but not as the *primary*
> system where you might try to have two servers simultaneously try to
> mount the file system and use MMP as the "election" mechanism to
> decide which server is going to be the primary system, and which would
> be the backup system.
>
> The cost of being able to handle this particular race is it would slow
> down the mounts of cleanly unmounted systems.

Ted's understanding is correct - MMP is intended to be a backup mechanism
to prevent filesystem corruption in the case where external HA methods
do the wrong thing. This has avoided problems countless times on systems
with multi-port access to the same storage, and can also be useful in the
case of shared VM images accessed over the network, and similar.

When MMP was implemented for ZFS, a slightly different mechanism was used.
Rather than having the delay to detect concurrent mounts, it instead writes
to multiple different blocks in a random order, and then reads them all.
If two nodes try to mount the filesystem concurrently, they would pick
different block orders, and the chance of them having the same order (and
one clobbering all of the blocks of the other) would be 1/2^num_blocks.
The drawback is that this would consume more space in the filesystem, but
it wouldn't be a huge deal these days.

> There *are* better systems to implement leader elections[1] than using
> MMP. Most of these more efficient leader elections assume that you
> have a working IP network, and so if you have a separate storage
> network (including a shared SCSI bus) from your standard IP network,
> then MMP is a useful failsafe in the face of a network partition of
> your IP network. The question is whether MMP should be useful for
> more than that. And if it isn't, then we should probably document
> what MMP is and isn't good for, and give advice in the form of an
> application note for how MMP should be used in the context of a larger
> system.

One of the existing failure cases with HA that MMP detects is loss of
network connection, so I wouldn't want to depend on that.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP