2019-02-22 00:58:43

by NeilBrown

[permalink] [raw]
Subject: Another OPEN / OPEN_DOWNGRADE race


Hi,
I have a report of an NFSv4.1 state management problem in our 4.4
kernel which appears to be caused by a race between OPEN and
OPEN_DOWNGRADE, and I don't think the race is fixed in mainline.

The test program creates multiple threads which open a file O_RDWR and
write to it, then opens for O_RDONLY and reads from it.
A network trace which shows the problem suggests that at a point in
time where there are some O_RDONLY opens and one O_RDWR open, a new
O_RDWR open is requested just as the O_RDWR open is being closed.

The close of the O_RDWR clears state->n_rdwr early so can_open_cached()
fails for the O_RDWR open, and it needs to go to the server.
The open for O_RDWR doesn't increment n_rdwr until after the open
succeeds, so nfs4_close_prepare sees
n_rdwr == 0
n_rdonly > 0
NFS_O_RDWR_STATE and NFS_O_RDONLY_STATE set
which causes it to choose an OPEN_DOWNGRADE.

What we see is a OPEN/share-all and an OPEN_DOWNGRADE/share-read
request are sent one after the other without waiting for a reply.
The OPEN is processed first, then the OPEN_DOWNGRADE, resulting in a
state that only allows reads. Then a WRITE is attempted which fails.
This enters a infinite loop with 2 steps:
- a WRITE gets NFS4ERR_OPENMODE
- a TEST_STATEID succeeds

Once an OPEN/share-all request has been sent, it isn't really correct
to send an OPEN_DOWNGRADE/share-read request. However the fact that
the OPEN has been sent isn't visible to nfs4_close_prepare().

There is an asymmetry between open and close w.r.t. updating the
n_[mode] counter and setting the NFS_O_mode_STATE bits.

For close, the counter is decremented, then the server is told, then
the state bits are cleared.
For open, the counter and state bits are both cleared after the server
is asked.

I understand that this is probably because the OPEN could fail, and
incrementing a counter before we are sure of success seems unwise. But
doing so would allow us to avoid the incorrect OPEN_DOWNGRADE.

Any suggestions on what a good solution would look like? Does it ever
make sense for an OPEN request to be concurrent with a CLOSE or
OPEN_DOWNGRADE ?? Maybe they should be serialized with each other
(maybe not as fully serialized as NFSv4.0, but more than they currently
are in NFSv4.1)

Thanks,
NeilBrown


Attachments:
signature.asc (832.00 B)

2019-02-22 04:06:51

by Trond Myklebust

[permalink] [raw]
Subject: Re: Another OPEN / OPEN_DOWNGRADE race

Hi Neil

On Fri, 2019-02-22 at 11:58 +1100, NeilBrown wrote:
> Hi,
> I have a report of an NFSv4.1 state management problem in our 4.4
> kernel which appears to be caused by a race between OPEN and
> OPEN_DOWNGRADE, and I don't think the race is fixed in mainline.
>
> The test program creates multiple threads which open a file O_RDWR
> and
> write to it, then opens for O_RDONLY and reads from it.
> A network trace which shows the problem suggests that at a point in
> time where there are some O_RDONLY opens and one O_RDWR open, a new
> O_RDWR open is requested just as the O_RDWR open is being closed.
>
> The close of the O_RDWR clears state->n_rdwr early so
> can_open_cached()
> fails for the O_RDWR open, and it needs to go to the server.
> The open for O_RDWR doesn't increment n_rdwr until after the open
> succeeds, so nfs4_close_prepare sees
> n_rdwr == 0
> n_rdonly > 0
> NFS_O_RDWR_STATE and NFS_O_RDONLY_STATE set
> which causes it to choose an OPEN_DOWNGRADE.
>
> What we see is a OPEN/share-all and an OPEN_DOWNGRADE/share-read
> request are sent one after the other without waiting for a reply.
> The OPEN is processed first, then the OPEN_DOWNGRADE, resulting in a
> state that only allows reads. Then a WRITE is attempted which
> fails.
> This enters a infinite loop with 2 steps:
> - a WRITE gets NFS4ERR_OPENMODE
> - a TEST_STATEID succeeds
>
> Once an OPEN/share-all request has been sent, it isn't really
> correct
> to send an OPEN_DOWNGRADE/share-read request. However the fact that
> the OPEN has been sent isn't visible to nfs4_close_prepare().
>
> There is an asymmetry between open and close w.r.t. updating the
> n_[mode] counter and setting the NFS_O_mode_STATE bits.
>
> For close, the counter is decremented, then the server is told, then
> the state bits are cleared.
> For open, the counter and state bits are both cleared after the
> server
> is asked.
>
> I understand that this is probably because the OPEN could fail, and
> incrementing a counter before we are sure of success seems
> unwise. But
> doing so would allow us to avoid the incorrect OPEN_DOWNGRADE.
>
> Any suggestions on what a good solution would look like? Does it
> ever
> make sense for an OPEN request to be concurrent with a CLOSE or
> OPEN_DOWNGRADE ?? Maybe they should be serialized with each other
> (maybe not as fully serialized as NFSv4.0, but more than they
> currently
> are in NFSv4.1)

If the CLOSE/OPEN_DOWNGRADE is processed by the server _after_ the
OPEN, then the stateid presented by the client will have an incorrect
seqid, and so the server should reject the operation with a
NFS4ERR_OLD_STATEID.

If, on the other hand, the CLOSE/OPEN_DOWNGRADE is processed before the
OPEN, then the client should either see a new stateid with a new
'other' field (CLOSE) or it will see the same stateid but with a seqid
that does not match what it expects (OPEN_DOWNGRADE). In the current
mainline code, the client will then attempt to wait for any "missing"
seqids to get processed (see nfs_set_open_stateid_locked()).

Cheers
Trond

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2019-02-22 05:02:33

by NeilBrown

[permalink] [raw]
Subject: Re: Another OPEN / OPEN_DOWNGRADE race

On Fri, Feb 22 2019, Trond Myklebust wrote:

> Hi Neil
>
> On Fri, 2019-02-22 at 11:58 +1100, NeilBrown wrote:
>> Hi,
>> I have a report of an NFSv4.1 state management problem in our 4.4
>> kernel which appears to be caused by a race between OPEN and
>> OPEN_DOWNGRADE, and I don't think the race is fixed in mainline.
>>
>> The test program creates multiple threads which open a file O_RDWR
>> and
>> write to it, then opens for O_RDONLY and reads from it.
>> A network trace which shows the problem suggests that at a point in
>> time where there are some O_RDONLY opens and one O_RDWR open, a new
>> O_RDWR open is requested just as the O_RDWR open is being closed.
>>
>> The close of the O_RDWR clears state->n_rdwr early so
>> can_open_cached()
>> fails for the O_RDWR open, and it needs to go to the server.
>> The open for O_RDWR doesn't increment n_rdwr until after the open
>> succeeds, so nfs4_close_prepare sees
>> n_rdwr == 0
>> n_rdonly > 0
>> NFS_O_RDWR_STATE and NFS_O_RDONLY_STATE set
>> which causes it to choose an OPEN_DOWNGRADE.
>>
>> What we see is a OPEN/share-all and an OPEN_DOWNGRADE/share-read
>> request are sent one after the other without waiting for a reply.
>> The OPEN is processed first, then the OPEN_DOWNGRADE, resulting in a
>> state that only allows reads. Then a WRITE is attempted which
>> fails.
>> This enters a infinite loop with 2 steps:
>> - a WRITE gets NFS4ERR_OPENMODE
>> - a TEST_STATEID succeeds
>>
>> Once an OPEN/share-all request has been sent, it isn't really
>> correct
>> to send an OPEN_DOWNGRADE/share-read request. However the fact that
>> the OPEN has been sent isn't visible to nfs4_close_prepare().
>>
>> There is an asymmetry between open and close w.r.t. updating the
>> n_[mode] counter and setting the NFS_O_mode_STATE bits.
>>
>> For close, the counter is decremented, then the server is told, then
>> the state bits are cleared.
>> For open, the counter and state bits are both cleared after the
>> server
>> is asked.
>>
>> I understand that this is probably because the OPEN could fail, and
>> incrementing a counter before we are sure of success seems
>> unwise. But
>> doing so would allow us to avoid the incorrect OPEN_DOWNGRADE.
>>
>> Any suggestions on what a good solution would look like? Does it
>> ever
>> make sense for an OPEN request to be concurrent with a CLOSE or
>> OPEN_DOWNGRADE ?? Maybe they should be serialized with each other
>> (maybe not as fully serialized as NFSv4.0, but more than they
>> currently
>> are in NFSv4.1)

Hi Trond,
thanks for the reply.

>
> If the CLOSE/OPEN_DOWNGRADE is processed by the server _after_ the
> OPEN, then the stateid presented by the client will have an incorrect
> seqid, and so the server should reject the operation with a
> NFS4ERR_OLD_STATEID.

According to RFC5661, section 18.18.3, the DESCRIPTION of
OPEN_DOWNGRADE,

The seqid argument is not used in NFSv4.1, MAY be any value, and
MUST be ignored by the server.

So the fact that the stateid passed (3) is less than the stateids
already returned for some OPEN operations (4 and 5), the server MUST
still process the OPEN_DOWNGRADE, not give an error.


>
> If, on the other hand, the CLOSE/OPEN_DOWNGRADE is processed before the
> OPEN, then the client should either see a new stateid with a new
> 'other' field (CLOSE) or it will see the same stateid but with a seqid
> that does not match what it expects (OPEN_DOWNGRADE). In the current
> mainline code, the client will then attempt to wait for any "missing"
> seqids to get processed (see nfs_set_open_stateid_locked()).

Hmm.. I don't have the waiting in my 4.4 code - I should probably make
sure I understand that. However the tcpdump trace shows that this
scenario isn't what is happening, so it isn't immediately relevant.

Thanks,
NeilBrown

>
> Cheers
> Trond
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]


Attachments:
signature.asc (832.00 B)

2019-02-22 13:31:46

by Trond Myklebust

[permalink] [raw]
Subject: Re: Another OPEN / OPEN_DOWNGRADE race

Hi Neil,

On Fri, 2019-02-22 at 16:02 +1100, NeilBrown wrote:
> On Fri, Feb 22 2019, Trond Myklebust wrote:
>
> > Hi Neil
> >
> > On Fri, 2019-02-22 at 11:58 +1100, NeilBrown wrote:
> > > Hi,
> > > I have a report of an NFSv4.1 state management problem in our
> > > 4.4
> > > kernel which appears to be caused by a race between OPEN and
> > > OPEN_DOWNGRADE, and I don't think the race is fixed in mainline.
> > >
> > > The test program creates multiple threads which open a file
> > > O_RDWR
> > > and
> > > write to it, then opens for O_RDONLY and reads from it.
> > > A network trace which shows the problem suggests that at a point
> > > in
> > > time where there are some O_RDONLY opens and one O_RDWR open, a
> > > new
> > > O_RDWR open is requested just as the O_RDWR open is being
> > > closed.
> > >
> > > The close of the O_RDWR clears state->n_rdwr early so
> > > can_open_cached()
> > > fails for the O_RDWR open, and it needs to go to the server.
> > > The open for O_RDWR doesn't increment n_rdwr until after the
> > > open
> > > succeeds, so nfs4_close_prepare sees
> > > n_rdwr == 0
> > > n_rdonly > 0
> > > NFS_O_RDWR_STATE and NFS_O_RDONLY_STATE set
> > > which causes it to choose an OPEN_DOWNGRADE.
> > >
> > > What we see is a OPEN/share-all and an OPEN_DOWNGRADE/share-read
> > > request are sent one after the other without waiting for a
> > > reply.
> > > The OPEN is processed first, then the OPEN_DOWNGRADE, resulting
> > > in a
> > > state that only allows reads. Then a WRITE is attempted which
> > > fails.
> > > This enters a infinite loop with 2 steps:
> > > - a WRITE gets NFS4ERR_OPENMODE
> > > - a TEST_STATEID succeeds
> > >
> > > Once an OPEN/share-all request has been sent, it isn't really
> > > correct
> > > to send an OPEN_DOWNGRADE/share-read request. However the fact
> > > that
> > > the OPEN has been sent isn't visible to nfs4_close_prepare().
> > >
> > > There is an asymmetry between open and close w.r.t. updating the
> > > n_[mode] counter and setting the NFS_O_mode_STATE bits.
> > >
> > > For close, the counter is decremented, then the server is told,
> > > then
> > > the state bits are cleared.
> > > For open, the counter and state bits are both cleared after the
> > > server
> > > is asked.
> > >
> > > I understand that this is probably because the OPEN could fail,
> > > and
> > > incrementing a counter before we are sure of success seems
> > > unwise. But
> > > doing so would allow us to avoid the incorrect OPEN_DOWNGRADE.
> > >
> > > Any suggestions on what a good solution would look like? Does
> > > it
> > > ever
> > > make sense for an OPEN request to be concurrent with a CLOSE or
> > > OPEN_DOWNGRADE ?? Maybe they should be serialized with each
> > > other
> > > (maybe not as fully serialized as NFSv4.0, but more than they
> > > currently
> > > are in NFSv4.1)
>
> Hi Trond,
> thanks for the reply.
>
> > If the CLOSE/OPEN_DOWNGRADE is processed by the server _after_ the
> > OPEN, then the stateid presented by the client will have an
> > incorrect
> > seqid, and so the server should reject the operation with a
> > NFS4ERR_OLD_STATEID.
>
> According to RFC5661, section 18.18.3, the DESCRIPTION of
> OPEN_DOWNGRADE,
>
> The seqid argument is not used in NFSv4.1, MAY be any value, and
> MUST be ignored by the server.
>
> So the fact that the stateid passed (3) is less than the stateids
> already returned for some OPEN operations (4 and 5), the server MUST
> still process the OPEN_DOWNGRADE, not give an error.

That refers to the explicit "seqid" parameter in struct
OPEN_DOWNGRADE4args. I'm referring to the seqid that constitutes the
first 4 bytes of the stateid.

In other words, I'm referring to the mechanism described here:
https://tools.ietf.org/html/rfc5661#section-8.2.2

Note that the Linux client doesn't ever set the stateid seqid to zero
on CLOSE or OPEN_DOWNGRADE precisely because we rely on the
NFS4ERR_OLD_STATEID mechanism to detect races with OPEN.

>
> > If, on the other hand, the CLOSE/OPEN_DOWNGRADE is processed before
> > the
> > OPEN, then the client should either see a new stateid with a new
> > 'other' field (CLOSE) or it will see the same stateid but with a
> > seqid
> > that does not match what it expects (OPEN_DOWNGRADE). In the
> > current
> > mainline code, the client will then attempt to wait for any
> > "missing"
> > seqids to get processed (see nfs_set_open_stateid_locked()).
>
> Hmm.. I don't have the waiting in my 4.4 code - I should probably
> make
> sure I understand that. However the tcpdump trace shows that this
> scenario isn't what is happening, so it isn't immediately relevant.
>

True. The waiting mechanism was introduced by commit c9399f21c215
("NFSv4: Fix OPEN / CLOSE race") in v4.15. We should probably have
kicked that down the stable patch mechanism.
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2019-02-23 01:22:32

by NeilBrown

[permalink] [raw]
Subject: Re: Another OPEN / OPEN_DOWNGRADE race

On Fri, Feb 22 2019, Trond Myklebust wrote:

> Hi Neil,
>
> On Fri, 2019-02-22 at 16:02 +1100, NeilBrown wrote:
>> On Fri, Feb 22 2019, Trond Myklebust wrote:
>>
>> > Hi Neil
>> >
>> > On Fri, 2019-02-22 at 11:58 +1100, NeilBrown wrote:
>> > > Hi,
>> > > I have a report of an NFSv4.1 state management problem in our
>> > > 4.4
>> > > kernel which appears to be caused by a race between OPEN and
>> > > OPEN_DOWNGRADE, and I don't think the race is fixed in mainline.
>> > >
>> > > The test program creates multiple threads which open a file
>> > > O_RDWR
>> > > and
>> > > write to it, then opens for O_RDONLY and reads from it.
>> > > A network trace which shows the problem suggests that at a point
>> > > in
>> > > time where there are some O_RDONLY opens and one O_RDWR open, a
>> > > new
>> > > O_RDWR open is requested just as the O_RDWR open is being
>> > > closed.
>> > >
>> > > The close of the O_RDWR clears state->n_rdwr early so
>> > > can_open_cached()
>> > > fails for the O_RDWR open, and it needs to go to the server.
>> > > The open for O_RDWR doesn't increment n_rdwr until after the
>> > > open
>> > > succeeds, so nfs4_close_prepare sees
>> > > n_rdwr == 0
>> > > n_rdonly > 0
>> > > NFS_O_RDWR_STATE and NFS_O_RDONLY_STATE set
>> > > which causes it to choose an OPEN_DOWNGRADE.
>> > >
>> > > What we see is a OPEN/share-all and an OPEN_DOWNGRADE/share-read
>> > > request are sent one after the other without waiting for a
>> > > reply.
>> > > The OPEN is processed first, then the OPEN_DOWNGRADE, resulting
>> > > in a
>> > > state that only allows reads. Then a WRITE is attempted which
>> > > fails.
>> > > This enters a infinite loop with 2 steps:
>> > > - a WRITE gets NFS4ERR_OPENMODE
>> > > - a TEST_STATEID succeeds
>> > >
>> > > Once an OPEN/share-all request has been sent, it isn't really
>> > > correct
>> > > to send an OPEN_DOWNGRADE/share-read request. However the fact
>> > > that
>> > > the OPEN has been sent isn't visible to nfs4_close_prepare().
>> > >
>> > > There is an asymmetry between open and close w.r.t. updating the
>> > > n_[mode] counter and setting the NFS_O_mode_STATE bits.
>> > >
>> > > For close, the counter is decremented, then the server is told,
>> > > then
>> > > the state bits are cleared.
>> > > For open, the counter and state bits are both cleared after the
>> > > server
>> > > is asked.
>> > >
>> > > I understand that this is probably because the OPEN could fail,
>> > > and
>> > > incrementing a counter before we are sure of success seems
>> > > unwise. But
>> > > doing so would allow us to avoid the incorrect OPEN_DOWNGRADE.
>> > >
>> > > Any suggestions on what a good solution would look like? Does
>> > > it
>> > > ever
>> > > make sense for an OPEN request to be concurrent with a CLOSE or
>> > > OPEN_DOWNGRADE ?? Maybe they should be serialized with each
>> > > other
>> > > (maybe not as fully serialized as NFSv4.0, but more than they
>> > > currently
>> > > are in NFSv4.1)
>>
>> Hi Trond,
>> thanks for the reply.
>>
>> > If the CLOSE/OPEN_DOWNGRADE is processed by the server _after_ the
>> > OPEN, then the stateid presented by the client will have an
>> > incorrect
>> > seqid, and so the server should reject the operation with a
>> > NFS4ERR_OLD_STATEID.
>>
>> According to RFC5661, section 18.18.3, the DESCRIPTION of
>> OPEN_DOWNGRADE,
>>
>> The seqid argument is not used in NFSv4.1, MAY be any value, and
>> MUST be ignored by the server.
>>
>> So the fact that the stateid passed (3) is less than the stateids
>> already returned for some OPEN operations (4 and 5), the server MUST
>> still process the OPEN_DOWNGRADE, not give an error.
>
> That refers to the explicit "seqid" parameter in struct
> OPEN_DOWNGRADE4args. I'm referring to the seqid that constitutes the
> first 4 bytes of the stateid.
>
> In other words, I'm referring to the mechanism described here:
> https://tools.ietf.org/html/rfc5661#section-8.2.2
>
> Note that the Linux client doesn't ever set the stateid seqid to zero
> on CLOSE or OPEN_DOWNGRADE precisely because we rely on the
> NFS4ERR_OLD_STATEID mechanism to detect races with OPEN.

Ahhh, I had missed that there were two seqids in OPEN_DOWNGRADE, on in
the stateid, one not.
Makes sense now.

With that perspective, it is pretty clear that the server is
misbehaving.
The OPEN_DOWNGRADE presents a seqid of 3 and after returning two open
requests with seqids of 4 and 5, the server processes that
OPEN_DOWNGRADE without error and returns a seqid of 6.

Thanks a lot,

NeilBrown


>
>>
>> > If, on the other hand, the CLOSE/OPEN_DOWNGRADE is processed before
>> > the
>> > OPEN, then the client should either see a new stateid with a new
>> > 'other' field (CLOSE) or it will see the same stateid but with a
>> > seqid
>> > that does not match what it expects (OPEN_DOWNGRADE). In the
>> > current
>> > mainline code, the client will then attempt to wait for any
>> > "missing"
>> > seqids to get processed (see nfs_set_open_stateid_locked()).
>>
>> Hmm.. I don't have the waiting in my 4.4 code - I should probably
>> make
>> sure I understand that. However the tcpdump trace shows that this
>> scenario isn't what is happening, so it isn't immediately relevant.
>>
>
> True. The waiting mechanism was introduced by commit c9399f21c215
> ("NFSv4: Fix OPEN / CLOSE race") in v4.15. We should probably have
> kicked that down the stable patch mechanism.
> Cheers
> Trond
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]


Attachments:
signature.asc (832.00 B)