2008-08-14 11:08:20

by NAKANO Hiroaki

[permalink] [raw]
Subject: [PATCH]lockd: fix handling of grace period after long periods of inactivity

lockd uses time_before() to determine whether the grace period has
expired. This would seem to be enough to avoid timer wrap-around issues,
but, unfortunately, that is not the case. The time_* family of
comparison functions can be safely used to compare jiffies relatively
close in time, but they stop working after approximately LONG_MAX/2
ticks. nfsd can suffer this problem because the time_before() comparison
in lockd() is not performed until the first request comes in, which
means that if there is no lockd traffic for more than LONG_MAX/2 ticks
we are screwed.

The implication of this is that once time_before() starts misbehaving
any attempt from a NFS client to execute fcntl() will be received with a
NLM_LCK_DENIED_GRACE_PERIOD message for 25 days (assuming HZ=1000). In
other words, the 50 seconds grace period could turn into a grace period
of 50 days or more.

This patch corrects this behavior by implementing grace period with a
(retriggerable) timer.

Note: This bug was analyzed independently by Oda-san <[email protected]>
and myself.

Signed-off-by: Nakano Hiroaki <[email protected]>
---

diff -Nrup linux-2.6.27-rc3/fs/lockd/svc.c b/fs/lockd/svc.c
--- linux-2.6.27-rc3/fs/lockd/svc.c 2008-08-14 16:54:24.000000000 +0900
+++ b/fs/lockd/svc.c 2008-08-14 17:04:15.000000000 +0900
@@ -53,6 +53,7 @@ static struct task_struct *nlmsvc_task;
static struct svc_rqst *nlmsvc_rqst;
int nlmsvc_grace_period;
unsigned long nlmsvc_timeout;
+static struct timer_list nlm_period_timer;

/*
* These can be set at insmod time (useful for NFS as root filesystem),
@@ -141,6 +142,12 @@ lockd(void *vrqstp)

grace_period_expire = set_grace_period();

+ init_timer(&nlm_period_timer);
+ nlm_period_timer.function = clear_grace_period;
+ nlm_period_timer.expires = grace_period_expire;
+
+ add_timer(&nlm_period_timer);
+
/*
* The main request loop. We don't terminate until the last
* NFS mount or NFS daemon has gone away.
@@ -154,6 +161,7 @@ lockd(void *vrqstp)
if (nlmsvc_ops) {
nlmsvc_invalidate_all();
grace_period_expire = set_grace_period();
+ mod_timer(&nlm_period_timer, grace_period_expire);
}
continue;
}
@@ -164,10 +172,8 @@ lockd(void *vrqstp)
* (Theoretically, there shouldn't even be blocked locks
* during grace period).
*/
- if (!nlmsvc_grace_period) {
+ if (!nlmsvc_grace_period)
timeout = nlmsvc_retry_blocked();
- } else if (time_before(grace_period_expire, jiffies))
- clear_grace_period();

/*
* Find a socket with data available and call its
@@ -195,6 +201,7 @@ lockd(void *vrqstp)
svc_process(rqstp);
}
flush_signals(current);
+ del_timer(&nlm_period_timer);
if (nlmsvc_ops)
nlmsvc_invalidate_all();
nlm_shutdown_hosts();

--
(NAKANO, Hiroaki) <[email protected]>



Subject: Re: [PATCH]lockd: fix handling of grace period after long periods of inactivity

On Tue, 2008-08-19 at 18:12 -0400, J. Bruce Fields wrote:
> On Fri, Aug 15, 2008 at 10:32:25AM +0900, Fernando Luis V=C3=A1zquez =
Cao wrote:
> > Hi Bruce!
> >=20
> > On Thu, 2008-08-14 at 15:06 -0400, J. Bruce Fields wrote:
> > > On Thu, Aug 14, 2008 at 08:08:16PM +0900, NAKANO Hiroaki wrote:
> > > > lockd uses time_before() to determine whether the grace period =
has
> > > > expired. This would seem to be enough to avoid timer wrap-aroun=
d issues,
> > > > but, unfortunately, that is not the case. The time_* family of
> > > > comparison functions can be safely used to compare jiffies rela=
tively
> > > > close in time, but they stop working after approximately LONG_M=
AX/2
> > > > ticks. nfsd can suffer this problem because the time_before() c=
omparison
> > > > in lockd() is not performed until the first request comes in, w=
hich
> > > > means that if there is no lockd traffic for more than LONG_MAX/=
2 ticks
> > > > we are screwed.
> > > >=20
> > > > The implication of this is that once time_before() starts misbe=
having
> > > > any attempt from a NFS client to execute fcntl() will be receiv=
ed with a
> > > > NLM_LCK_DENIED_GRACE_PERIOD message for 25 days (assuming HZ=3D=
1000). In
> > > > other words, the 50 seconds grace period could turn into a grac=
e period
> > > > of 50 days or more.
> > > >=20
> > > > This patch corrects this behavior by implementing grace period =
with a
> > > > (retriggerable) timer.
> > > >=20
> > > > Note: This bug was analyzed independently by Oda-san <oda@valin=
ux.co.jp>
> > > > and myself.
> > >=20
> > > Good catch! Did you actually run across this in practice? I wou=
ld've
> > > thought it relatively unusual to have a lockd that didn't receive=
its
> > > first lock request until 25 days after startup.
> > Yes, we did find this problem in production. More often than one wo=
uld
> > wish, installing new software in a system that has been running wit=
hout
> > a hiccup for weeks or months is the only thing you will need to bri=
ng
> > mayhem.
> >=20
> > > I still have a mild preference for a work struct just in case we =
end up
> > > wanting to do something slightly more complicated to end the grac=
e
> > > period, but I don't really have anything in mind.
> > For simplicity I think we could we get Nakano-san's patch merged fi=
rst.
> > If needed, moving to a work-based solution should be relatively eas=
ily.
>=20
> There's no real difference; patches I've been planning on submitting =
for
> 2.6.28 follow.
Yes, both approaches are essentially the same.

> patches I've been planning on submitting for 2.6.28 follow.
Thanks!

> (We could submit a patch for 2.6.27, since it's a bugfix, but this is=
n't
> a new regression, so existing users at least won't be made any worse
> off, and this doesn't crash the server, or anything similarly drastic=
=2E
> It's still serious, just not quite serious enough to submit at this
> point in the release cycle, as I understand the rules....)
My only concern is that even though the server will not crash we may ge=
t
seemingly hanged processes on the client side (the reason being that th=
e
client will be receiving only NLM_LCK_DENIED_GRACE_PERIOD messages
during the whole "extended" grace period). It might be worth getting it
merged before 2.6.27 comes out and distros start using it.

- Fernando


2008-08-14 19:07:05

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH]lockd: fix handling of grace period after long periods of inactivity

On Thu, Aug 14, 2008 at 08:08:16PM +0900, NAKANO Hiroaki wrote:
> lockd uses time_before() to determine whether the grace period has
> expired. This would seem to be enough to avoid timer wrap-around issues,
> but, unfortunately, that is not the case. The time_* family of
> comparison functions can be safely used to compare jiffies relatively
> close in time, but they stop working after approximately LONG_MAX/2
> ticks. nfsd can suffer this problem because the time_before() comparison
> in lockd() is not performed until the first request comes in, which
> means that if there is no lockd traffic for more than LONG_MAX/2 ticks
> we are screwed.
>
> The implication of this is that once time_before() starts misbehaving
> any attempt from a NFS client to execute fcntl() will be received with a
> NLM_LCK_DENIED_GRACE_PERIOD message for 25 days (assuming HZ=1000). In
> other words, the 50 seconds grace period could turn into a grace period
> of 50 days or more.
>
> This patch corrects this behavior by implementing grace period with a
> (retriggerable) timer.
>
> Note: This bug was analyzed independently by Oda-san <[email protected]>
> and myself.

Good catch! Did you actually run across this in practice? I would've
thought it relatively unusual to have a lockd that didn't receive its
first lock request until 25 days after startup.

I've actually had a patch that does roughly the same thing for a while
at:

git://linux-nfs.org/~bfields/linux.git failover

3ff893a7.. "lockd: don't depend on lockd main loop to end grace" but
hadn't submitted it since I didn't see the bug you found. (I had other
reasons I wanted to do this). Difference that I can see off-hand:

- I used schedule_delayed_work instead of a timer.
- I forgot to delete the thing before exiting!

So I think my solution has a bug that yours doesn't. (I don't see what
would stop the module being removed before my work gets scheduled.)

I still have a mild preference for a work struct just in case we end up
wanting to do something slightly more complicated to end the grace
period, but I don't really have anything in mind.

--b.

>
> Signed-off-by: Nakano Hiroaki <[email protected]>
> ---
>
> diff -Nrup linux-2.6.27-rc3/fs/lockd/svc.c b/fs/lockd/svc.c
> --- linux-2.6.27-rc3/fs/lockd/svc.c 2008-08-14 16:54:24.000000000 +0900
> +++ b/fs/lockd/svc.c 2008-08-14 17:04:15.000000000 +0900
> @@ -53,6 +53,7 @@ static struct task_struct *nlmsvc_task;
> static struct svc_rqst *nlmsvc_rqst;
> int nlmsvc_grace_period;
> unsigned long nlmsvc_timeout;
> +static struct timer_list nlm_period_timer;
>
> /*
> * These can be set at insmod time (useful for NFS as root filesystem),
> @@ -141,6 +142,12 @@ lockd(void *vrqstp)
>
> grace_period_expire = set_grace_period();
>
> + init_timer(&nlm_period_timer);
> + nlm_period_timer.function = clear_grace_period;
> + nlm_period_timer.expires = grace_period_expire;
> +
> + add_timer(&nlm_period_timer);
> +
> /*
> * The main request loop. We don't terminate until the last
> * NFS mount or NFS daemon has gone away.
> @@ -154,6 +161,7 @@ lockd(void *vrqstp)
> if (nlmsvc_ops) {
> nlmsvc_invalidate_all();
> grace_period_expire = set_grace_period();
> + mod_timer(&nlm_period_timer, grace_period_expire);
> }
> continue;
> }
> @@ -164,10 +172,8 @@ lockd(void *vrqstp)
> * (Theoretically, there shouldn't even be blocked locks
> * during grace period).
> */
> - if (!nlmsvc_grace_period) {
> + if (!nlmsvc_grace_period)
> timeout = nlmsvc_retry_blocked();
> - } else if (time_before(grace_period_expire, jiffies))
> - clear_grace_period();
>
> /*
> * Find a socket with data available and call its
> @@ -195,6 +201,7 @@ lockd(void *vrqstp)
> svc_process(rqstp);
> }
> flush_signals(current);
> + del_timer(&nlm_period_timer);
> if (nlmsvc_ops)
> nlmsvc_invalidate_all();
> nlm_shutdown_hosts();
>
> --
> (NAKANO, Hiroaki) <[email protected]>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Subject: Re: [PATCH]lockd: fix handling of grace period after long periods of inactivity

Hi Bruce!

On Thu, 2008-08-14 at 15:06 -0400, J. Bruce Fields wrote:
> On Thu, Aug 14, 2008 at 08:08:16PM +0900, NAKANO Hiroaki wrote:
> > lockd uses time_before() to determine whether the grace period has
> > expired. This would seem to be enough to avoid timer wrap-around issues,
> > but, unfortunately, that is not the case. The time_* family of
> > comparison functions can be safely used to compare jiffies relatively
> > close in time, but they stop working after approximately LONG_MAX/2
> > ticks. nfsd can suffer this problem because the time_before() comparison
> > in lockd() is not performed until the first request comes in, which
> > means that if there is no lockd traffic for more than LONG_MAX/2 ticks
> > we are screwed.
> >
> > The implication of this is that once time_before() starts misbehaving
> > any attempt from a NFS client to execute fcntl() will be received with a
> > NLM_LCK_DENIED_GRACE_PERIOD message for 25 days (assuming HZ=1000). In
> > other words, the 50 seconds grace period could turn into a grace period
> > of 50 days or more.
> >
> > This patch corrects this behavior by implementing grace period with a
> > (retriggerable) timer.
> >
> > Note: This bug was analyzed independently by Oda-san <[email protected]>
> > and myself.
>
> Good catch! Did you actually run across this in practice? I would've
> thought it relatively unusual to have a lockd that didn't receive its
> first lock request until 25 days after startup.
Yes, we did find this problem in production. More often than one would
wish, installing new software in a system that has been running without
a hiccup for weeks or months is the only thing you will need to bring
mayhem.

> I still have a mild preference for a work struct just in case we end up
> wanting to do something slightly more complicated to end the grace
> period, but I don't really have anything in mind.
For simplicity I think we could we get Nakano-san's patch merged first.
If needed, moving to a work-based solution should be relatively easily.

Thank you for you comments!

- Fernando


2008-08-19 22:12:14

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH]lockd: fix handling of grace period after long periods of inactivity

On Fri, Aug 15, 2008 at 10:32:25AM +0900, Fernando Luis V=C3=A1zquez Ca=
o wrote:
> Hi Bruce!
>=20
> On Thu, 2008-08-14 at 15:06 -0400, J. Bruce Fields wrote:
> > On Thu, Aug 14, 2008 at 08:08:16PM +0900, NAKANO Hiroaki wrote:
> > > lockd uses time_before() to determine whether the grace period ha=
s
> > > expired. This would seem to be enough to avoid timer wrap-around =
issues,
> > > but, unfortunately, that is not the case. The time_* family of
> > > comparison functions can be safely used to compare jiffies relati=
vely
> > > close in time, but they stop working after approximately LONG_MAX=
/2
> > > ticks. nfsd can suffer this problem because the time_before() com=
parison
> > > in lockd() is not performed until the first request comes in, whi=
ch
> > > means that if there is no lockd traffic for more than LONG_MAX/2 =
ticks
> > > we are screwed.
> > >=20
> > > The implication of this is that once time_before() starts misbeha=
ving
> > > any attempt from a NFS client to execute fcntl() will be received=
with a
> > > NLM_LCK_DENIED_GRACE_PERIOD message for 25 days (assuming HZ=3D10=
00). In
> > > other words, the 50 seconds grace period could turn into a grace =
period
> > > of 50 days or more.
> > >=20
> > > This patch corrects this behavior by implementing grace period wi=
th a
> > > (retriggerable) timer.
> > >=20
> > > Note: This bug was analyzed independently by Oda-san <oda@valinux=
=2Eco.jp>
> > > and myself.
> >=20
> > Good catch! Did you actually run across this in practice? I would=
've
> > thought it relatively unusual to have a lockd that didn't receive i=
ts
> > first lock request until 25 days after startup.
> Yes, we did find this problem in production. More often than one woul=
d
> wish, installing new software in a system that has been running witho=
ut
> a hiccup for weeks or months is the only thing you will need to bring
> mayhem.
>=20
> > I still have a mild preference for a work struct just in case we en=
d up
> > wanting to do something slightly more complicated to end the grace
> > period, but I don't really have anything in mind.
> For simplicity I think we could we get Nakano-san's patch merged firs=
t.
> If needed, moving to a work-based solution should be relatively easil=
y.

There's no real difference; patches I've been planning on submitting fo=
r
2.6.28 follow.

(We could submit a patch for 2.6.27, since it's a bugfix, but this isn'=
t
a new regression, so existing users at least won't be made any worse
off, and this doesn't crash the server, or anything similarly drastic.
It's still serious, just not quite serious enough to submit at this
point in the release cycle, as I understand the rules....)

--b.

2008-08-19 22:12:44

by J.Bruce Fields

[permalink] [raw]
Subject: [PATCH 1/2] locks: allow lockd to process blocked locks during grace period

The check here is currently harmless but unnecessary, since, as the
comment notes, there aren't any blocked-lock callbacks to process
during the grace period anyway.

And eventually we want to allow multiple grace periods that come and go
for different filesystems over the course of the lifetime of lockd, at
which point this check is just going to get in the way.

Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/lockd/svc.c | 12 +++---------
1 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 1553fec..bdc607b 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -158,15 +158,9 @@ lockd(void *vrqstp)
continue;
}

- /*
- * Retry any blocked locks that have been notified by
- * the VFS. Don't do this during grace period.
- * (Theoretically, there shouldn't even be blocked locks
- * during grace period).
- */
- if (!nlmsvc_grace_period) {
- timeout = nlmsvc_retry_blocked();
- } else if (time_before(grace_period_expire, jiffies))
+ timeout = nlmsvc_retry_blocked();
+
+ if (time_before(grace_period_expire, jiffies))
clear_grace_period();

/*
--
1.5.5.rc1


2008-08-19 22:12:46

by J.Bruce Fields

[permalink] [raw]
Subject: [PATCH 2/2] lockd: don't depend on lockd main loop to end grace

End lockd's grace period using schedule_delayed_work() instead of a
check on every pass through the main loop.

After a later patch, we'll depend on lockd to end its grace period even
if it's not currently handling requests; so it shouldn't depend on being
woken up from the main loop to do so.

Also, Nakano Hiroaki (who independently produced a similar patch)
noticed that the current behavior is buggy in the face of jiffies
wraparound:

"lockd uses time_before() to determine whether the grace period
has expired. This would seem to be enough to avoid timer
wrap-around issues, but, unfortunately, that is not the case.
The time_* family of comparison functions can be safely used to
compare jiffies relatively close in time, but they stop working
after approximately LONG_MAX/2 ticks. nfsd can suffer this
problem because the time_before() comparison in lockd() is not
performed until the first request comes in, which means that if
there is no lockd traffic for more than LONG_MAX/2 ticks we are
screwed.

"The implication of this is that once time_before() starts
misbehaving any attempt from a NFS client to execute fcntl()
will be received with a NLM_LCK_DENIED_GRACE_PERIOD message for
25 days (assuming HZ=1000). In other words, the 50 seconds grace
period could turn into a grace period of 50 days or more.

"Note: This bug was analyzed independently by Oda-san
<[email protected]> and myself."

Signed-off-by: J. Bruce Fields <[email protected]>
Cc: Nakano Hiroaki <[email protected]>
Cc: Itsuro Oda <[email protected]>
---
fs/lockd/svc.c | 23 ++++++++++++-----------
1 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index bdc607b..450a3fa 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -97,15 +97,19 @@ unsigned long get_nfs_grace_period(void)
}
EXPORT_SYMBOL(get_nfs_grace_period);

-static unsigned long set_grace_period(void)
+static void grace_ender(struct work_struct *not_used)
{
- nlmsvc_grace_period = 1;
- return get_nfs_grace_period() + jiffies;
+ nlmsvc_grace_period = 0;
}

-static inline void clear_grace_period(void)
+static DECLARE_DELAYED_WORK(grace_period_end, grace_ender);
+
+static void set_grace_period(void)
{
- nlmsvc_grace_period = 0;
+ unsigned long grace_period = get_nfs_grace_period() + jiffies;
+
+ nlmsvc_grace_period = 1;
+ schedule_delayed_work(&grace_period_end, grace_period);
}

/*
@@ -116,7 +120,6 @@ lockd(void *vrqstp)
{
int err = 0, preverr = 0;
struct svc_rqst *rqstp = vrqstp;
- unsigned long grace_period_expire;

/* try_to_freeze() is called from svc_recv() */
set_freezable();
@@ -139,7 +142,7 @@ lockd(void *vrqstp)
nlm_timeout = LOCKD_DFLT_TIMEO;
nlmsvc_timeout = nlm_timeout * HZ;

- grace_period_expire = set_grace_period();
+ set_grace_period();

/*
* The main request loop. We don't terminate until the last
@@ -153,16 +156,13 @@ lockd(void *vrqstp)
flush_signals(current);
if (nlmsvc_ops) {
nlmsvc_invalidate_all();
- grace_period_expire = set_grace_period();
+ set_grace_period();
}
continue;
}

timeout = nlmsvc_retry_blocked();

- if (time_before(grace_period_expire, jiffies))
- clear_grace_period();
-
/*
* Find a socket with data available and call its
* recvfrom routine.
@@ -189,6 +189,7 @@ lockd(void *vrqstp)
svc_process(rqstp);
}
flush_signals(current);
+ cancel_delayed_work_sync(&grace_period_end);
if (nlmsvc_ops)
nlmsvc_invalidate_all();
nlm_shutdown_hosts();
--
1.5.5.rc1