LinuxLists.cc - long delay when mounting due to SETCLIENTID AUTH

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On May 3, 2013, at 3:39 PM, Jeff Layton <[email protected]> wrote:

> On Fri, 3 May 2013 15:26:09 -0400
> Chuck Lever <[email protected]> wrote:
>
>> I don't expect this issue to last for release after release. A moment ago you agreed that this shouldn't be intractable, so I fail to see the need to start wiring up long-term workarounds.
>>
>> Can't we just agree on a fix, and then get that into 3.10 as a regression fix?
>>
>
> I'm happy to help...I'm just not sure what you're proposing as the fix
> for the problem. What are you suggesting we do?

The kernel has historically had difficulty determining the status of GSS support in user space.

One way to remedy this would be to have "systemctl disable nfs-server.service" blacklist the kernel's GSS module, and "systemctl enable nfs-server.service" remove the blacklisting. This mechanism would have to be sensitive to both gssd and svcgssd, of course.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2013-05-03 19:39:44

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On Fri, 3 May 2013 15:26:09 -0400
Chuck Lever <[email protected]> wrote:

> I don't expect this issue to last for release after release. A moment ago you agreed that this shouldn't be intractable, so I fail to see the need to start wiring up long-term workarounds.
>
> Can't we just agree on a fix, and then get that into 3.10 as a regression fix?
>

I'm happy to help...I'm just not sure what you're proposing as the fix
for the problem. What are you suggesting we do?

--
Jeff Layton <[email protected]>

2013-05-07 20:53:30

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On Fri, May 03, 2013 at 05:50:49PM -0400, Chuck Lever wrote:
>
> On May 3, 2013, at 5:18 PM, "J. Bruce Fields" <[email protected]> wrote:
>
> > On Fri, May 03, 2013 at 02:48:59PM -0400, Chuck Lever wrote:
> >> We should always use krb5i if a GSS context can be established with
> >> our machine cred. As I said before, SETCLIENTID and
> >> GETATTR(fs_locations) really should use an integrity-protecting
> >> security flavor no matter what flavor is in effect on the mount points
> >> themselves.
> >
> > Can you give an example of a threat that could be avoided by this?
> >
> > My suspicion is that in most cases an attacker with the ability to
> > subvert auth_sys could *also* DOS gssd, and hence force the fallback to
> > auth_sys.
> >
> > krb5i plus a fallback to auth_sys on failure to authenticate doesn't
> > sound to me much more secure than just auth_sys.
>
> Our current situation is that the first mount of a server determines the flavor to use for SETCLIENTID. So if that mount happens to be "sec=sys" the SETCLIENTID is done with AUTH_SYS no matter what the subsequent mounts request.
>
> That's just about as secure in many cases as falling back.
>
> > If we really want much security benefit from krb5i on state operations,
> > I think we need to really *require* krb5i.
> >
> > So I'm inclined towards Jeff's solution: don't do this unless userspace
> > somehow affirmatively states that it requires krb5i on state operations.
>
> This would have to be on a per-server basis. Where would an admin specify such an option? I don't believe either a mount option (too fine) or a module parameter (too coarse) is appropriate.

Why do you think a mount option is too fine? They can use nfsmount.conf
to specify per-server or global defaults.

> > I agree that we should default to this when we can. But the way to move
> > towards that default is then to get distributions to turn on the new
> > module parm (or whatever it is) by default. As they do so, they can
> > also ensure that e.g. gssd is started.
>
> gssd is not all that's required.

Isn't that's all that's required to fix the delay problem? Or do you
still get a delay if you run gssd but don't create a keytab?

--b.

> A keytab must be provisioned on the client and server for it to work, and that's the main issue I'm trying to address: We need "sec=krb5" mounts to work when the client has no GSS machine credential.
>
> And I think we already have a problem with gssd not picking up kernel requests quickly, as Trond pointed out.
>
> I don't feel like any of this is new, or that any of this is a strong reason to revert. But it could be a reason to move forward from here.
>
> Does gssd distinguish between:
>
> - no local keytab
>
> - no local support for GSS
>
> - server doesn't grok GSS
>
> - some network problem occurred
>
> Can it communicate that distinction to the kernel? What if we fall back only in the "no keytab" and "client's kernel has no GSS support" cases? In the "server doesn't grok" case, do not fall back if "sec=krb5*" is specified on the mount point? "A network problem occurred" is an "always fail" case.
>
>
>
> >
> > --b.
> >
> >>> Instead of using AUTH_GSS for SETCLIENTID by default, would it make
> >>> sense to add a switch (module parm?) that turns it on so that it can be
> >>> an opt-in thing rather than doing this by default?
> >>
> >> Why add another tunable when we really should just fix the delay?
> >>
> >> Besides, if gssd is running and no keytab exists, then the fallback to AUTH_SYS should be fast. Is that not an effective workaround until we address the delay problem?
> >>
> >> --
> >> Chuck Lever
> >> chuck[dot]lever[at]oracle[dot]com
> >>
> >>
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
>

2013-05-03 21:51:02

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On May 3, 2013, at 5:18 PM, "J. Bruce Fields" <[email protected]> wrote:

> On Fri, May 03, 2013 at 02:48:59PM -0400, Chuck Lever wrote:
>> We should always use krb5i if a GSS context can be established with
>> our machine cred. As I said before, SETCLIENTID and
>> GETATTR(fs_locations) really should use an integrity-protecting
>> security flavor no matter what flavor is in effect on the mount points
>> themselves.
>
> Can you give an example of a threat that could be avoided by this?
>
> My suspicion is that in most cases an attacker with the ability to
> subvert auth_sys could *also* DOS gssd, and hence force the fallback to
> auth_sys.
>
> krb5i plus a fallback to auth_sys on failure to authenticate doesn't
> sound to me much more secure than just auth_sys.

Our current situation is that the first mount of a server determines the flavor to use for SETCLIENTID. So if that mount happens to be "sec=sys" the SETCLIENTID is done with AUTH_SYS no matter what the subsequent mounts request.

That's just about as secure in many cases as falling back.

> If we really want much security benefit from krb5i on state operations,
> I think we need to really *require* krb5i.
>
> So I'm inclined towards Jeff's solution: don't do this unless userspace
> somehow affirmatively states that it requires krb5i on state operations.

This would have to be on a per-server basis. Where would an admin specify such an option? I don't believe either a mount option (too fine) or a module parameter (too coarse) is appropriate.

> I agree that we should default to this when we can. But the way to move
> towards that default is then to get distributions to turn on the new
> module parm (or whatever it is) by default. As they do so, they can
> also ensure that e.g. gssd is started.

gssd is not all that's required. A keytab must be provisioned on the client and server for it to work, and that's the main issue I'm trying to address: We need "sec=krb5" mounts to work when the client has no GSS machine credential.

And I think we already have a problem with gssd not picking up kernel requests quickly, as Trond pointed out.

I don't feel like any of this is new, or that any of this is a strong reason to revert. But it could be a reason to move forward from here.

Does gssd distinguish between:

- no local keytab

- no local support for GSS

- server doesn't grok GSS

- some network problem occurred

Can it communicate that distinction to the kernel? What if we fall back only in the "no keytab" and "client's kernel has no GSS support" cases? In the "server doesn't grok" case, do not fall back if "sec=krb5*" is specified on the mount point? "A network problem occurred" is an "always fail" case.

>
> --b.
>
>>> Instead of using AUTH_GSS for SETCLIENTID by default, would it make
>>> sense to add a switch (module parm?) that turns it on so that it can be
>>> an opt-in thing rather than doing this by default?
>>
>> Why add another tunable when we really should just fix the delay?
>>
>> Besides, if gssd is running and no keytab exists, then the fallback to AUTH_SYS should be fast. Is that not an effective workaround until we address the delay problem?
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2013-05-07 21:36:51

by Myklebust, Trond

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On Tue, 2013-05-07 at 17:02 -0400, J. Bruce Fields wrote:
> I suppose if gssd is running then it should always hold open the parent
> (rpc_pipefs/nfs) directory. So if that isn't open it might be safe to
> assume we can fail immediately.

If rpc.idmapd is running, then it will do the same.

One possible solution might simply be to put up a 'gssd' pipe in
rpc_pipefs/nfs and use that as a metric. The problem is that IIRC,
rpc.gssd will release all pipes and then reopen them on getting a new
directory notification...

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2013-05-07 21:26:47

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On Tue, May 07, 2013 at 05:21:11PM -0400, Chuck Lever wrote:
>
> On May 7, 2013, at 5:02 PM, "J. Bruce Fields" <[email protected]> wrote:
>
> > On Mon, May 06, 2013 at 06:33:48PM -0400, Chuck Lever wrote:
> >>
> >> On May 3, 2013, at 5:50 PM, Chuck Lever <[email protected]> wrote:
> >>
> >>>
> >>> On May 3, 2013, at 5:18 PM, "J. Bruce Fields" <[email protected]> wrote:
> >>>
> >>>> On Fri, May 03, 2013 at 02:48:59PM -0400, Chuck Lever wrote:
> >>>>> We should always use krb5i if a GSS context can be established with
> >>>>> our machine cred. As I said before, SETCLIENTID and
> >>>>> GETATTR(fs_locations) really should use an integrity-protecting
> >>>>> security flavor no matter what flavor is in effect on the mount points
> >>>>> themselves.
> >>>>
> >>>> Can you give an example of a threat that could be avoided by this?
> >>>>
> >>>> My suspicion is that in most cases an attacker with the ability to
> >>>> subvert auth_sys could *also* DOS gssd, and hence force the fallback to
> >>>> auth_sys.
> >>>>
> >>>> krb5i plus a fallback to auth_sys on failure to authenticate doesn't
> >>>> sound to me much more secure than just auth_sys.
> >>>
> >>> Our current situation is that the first mount of a server determines the flavor to use for SETCLIENTID. So if that mount happens to be "sec=sys" the SETCLIENTID is done with AUTH_SYS no matter what the subsequent mounts request.
> >>>
> >>> That's just about as secure in many cases as falling back.
> >>>
> >>>> If we really want much security benefit from krb5i on state operations,
> >>>> I think we need to really *require* krb5i.
> >>>>
> >>>> So I'm inclined towards Jeff's solution: don't do this unless userspace
> >>>> somehow affirmatively states that it requires krb5i on state operations.
> >>>
> >>> This would have to be on a per-server basis. Where would an admin specify such an option? I don't believe either a mount option (too fine) or a module parameter (too coarse) is appropriate.
> >>>
> >>>> I agree that we should default to this when we can. But the way to move
> >>>> towards that default is then to get distributions to turn on the new
> >>>> module parm (or whatever it is) by default. As they do so, they can
> >>>> also ensure that e.g. gssd is started.
> >>>
> >>> gssd is not all that's required. A keytab must be provisioned on the client and server for it to work, and that's the main issue I'm trying to address: We need "sec=krb5" mounts to work when the client has no GSS machine credential.
> >>>
> >>> And I think we already have a problem with gssd not picking up kernel requests quickly, as Trond pointed out.
> >>>
> >>> I don't feel like any of this is new, or that any of this is a strong reason to revert. But it could be a reason to move forward from here.
> >>>
> >>> Does gssd distinguish between:
> >>>
> >>> - no local keytab
> >>>
> >>> - no local support for GSS
> >>>
> >>> - server doesn't grok GSS
> >>>
> >>> - some network problem occurred
> >>>
> >>> Can it communicate that distinction to the kernel? What if we fall back only in the "no keytab" and "client's kernel has no GSS support" cases? In the "server doesn't grok" case, do not fall back if "sec=krb5*" is specified on the mount point? "A network problem occurred" is an "always fail" case.
> >>
> >> After some thought, here's an algorithm for selecting a flavor to use for state management operations:
> >>
> >> Start with krb5i.
> >>
> >> - If a GSS flavor is specified on the mount point, and if
> >> there is no local keytab, fall back to AUTH_SYS; otherwise
> >> if any other issue occurs, fail immediately. This
> >> modification should address your security concern.
> >>
> >> - If a non-GSS flavor is specified on the mount point, or no
> >> flavor is specified, and there is any problem with krb5i,
> >> fall back to AUTH_SYS. This is the current 3.10 behavior,
> >> and assumes there is a solution to Jeff's 15 second upcall
> >> delay issue.
> >
> > I'm having a hard time thinking of one....
>
> Apparently not... :-)
>
> > I suppose if gssd is running then it should always hold open the parent
> > (rpc_pipefs/nfs) directory. So if that isn't open it might be safe to
> > assume we can fail immediately.
>
> I can look into that, if no-one beats me to it. I'm pretty slammed this week, though.
>
> Note also: If rpcauth_gss.ko isn't loaded, the rpcauth_create() call should fail immediately if the kernel cannot load that module. Why not make that module unloadable if gssd isn't running?

OK by me, but--that requires a userspace change, so doesn't really
address the kernel regression.

> > Right now for most people the effect of an upgrade to 3.10 is a new
> > 15 second delay on mount? (I'm assuming distributions default to
> > not running gssd.) Seems painful.
>
> The effect is that people who don't run gssd will see a 15 second
> delay the first time they mount a server with NFSv4. Subsequent
> mounts of that server should not see that delay because the
> SETCLIENTID has already been done.
>
> But 3.10 is not final. We have an opportunity to fix this now. So
> making claims about how painful this is right now is academic. Are
> you guys just trying to give me a hard time about this? Because it's
> not helping anybody. Nobody is saying this is something that should
> not be fixed as quickly as we can.

OK, good, sorry to pester.

--b.

2013-05-03 18:33:55

by Myklebust, Trond

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On Fri, 2013-05-03 at 14:24 -0400, Jeff Layton wrote:
> On Fri, 3 May 2013 13:56:13 -0400
> Chuck Lever <[email protected]> wrote:
>
> >
> > On May 3, 2013, at 1:25 PM, Jeff Layton <[email protected]> wrote:
> >
> > > I've noticed that when running a 3.10-pre kernel that if I try to mount
> > > up a NFSv4 filesystem that it now takes ~15s for the mount to complete.
> > >
> > > Here's a little rpcdebug output:
> > >
> > > [ 3056.385078] svc: server ffff8800368fc000 waiting for data (to = 9223372036854775807)
> > > [ 3056.392056] RPC: new task initialized, procpid 2471
> > > [ 3056.392758] RPC: allocated task ffff88010cd90100
> > > [ 3056.393303] RPC: 42 __rpc_execute flags=0x1280
> > > [ 3056.393630] RPC: 42 call_start nfs4 proc SETCLIENTID (sync)
> > > [ 3056.394056] RPC: 42 call_reserve (status 0)
> > > [ 3056.394368] RPC: 42 reserved req ffff8801019f9600 xid 21ad6c40
> > > [ 3056.394783] RPC: wake_up_first(ffff88010a989990 "xprt_sending")
> > > [ 3056.395252] RPC: 42 call_reserveresult (status 0)
> > > [ 3056.395595] RPC: 42 call_refresh (status 0)
> > > [ 3056.395901] RPC: gss_create_cred for uid 0, flavor 390004
> > > [ 3056.396361] RPC: gss_create_upcall for uid 0
> > > [ 3071.396134] RPC: AUTH_GSS upcall timed out.
> > > Please check user daemon is running.
> > > [ 3071.397374] RPC: gss_create_upcall for uid 0 result -13
> > > [ 3071.398192] RPC: 42 call_refreshresult (status -13)
> > > [ 3071.398873] RPC: 42 call_refreshresult: refresh creds failed with error -13
> > > [ 3071.399881] RPC: 42 return 0, status -13
> > >
> > > The problem is that we're now trying to upcall for GSS creds to do the
> > > SETCLIENTID call, but this host isn't running rpc.gssd. Not running
> > > rpc.gssd is pretty common for people not using kerberized NFS. I think
> > > we'll see a lot of complaints about this.
> > >
> > > Is this expected?
> >
> > Yes.
> >
> > There are operations like SETCLIENTID and GETATTR(fs_locations) which should always use an integrity-checking security flavor, even if particular mount points use sec=sys.
> >
> > There are cases where GSS is not available, and we fall back to using AUTH_SYS. That should happen as quickly as possible, I agree.
> >
> > > If so, what's the proposed remedy?
> > > Simply have everyone run rpc.gssd even if they're not using kerberized NFS?
> >
> >
> > That's one possibility. Or we could shorten the upcall timeout. Or, add a mechanism by which rpc.gssd can provide a positive indication to the kernel that it is running.
> >
> > It doesn't seem like an intractable problem.
> >
>
> Nope, it's not intractable at all...
>
> Currently, the gssd upcall uses the RPC_PIPE_WAIT_FOR_OPEN flag to
> allow you to queue upcalls to be processed when the daemon isn't up
> yet. When the daemon starts, it processes that queue. The caller gives
> up after 15s (which is what's happening here), and the upcall
> eventually gets scraped out of the queue after 30s.
>
> We could stop using that flag on this rpc_pipe and simply require that
> the daemon be up and running before attempting any sort of AUTH_GSS
> rpc. That might be a little less friendly in the face of boot-time
> ordering problems, but it should presumably make this problem go away.

You probably don't want to do that... The main reason for the
RPC_PIPE_WAIT_FOR_OPEN is that even if the gssd daemon is running, it
takes it a moment or two to notice that a new client directory has been
created, and that there is a new 'krb' pipe to attach to.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2013-05-03 19:26:17

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On May 3, 2013, at 3:17 PM, Jeff Layton <[email protected]> wrote:

> On Fri, 3 May 2013 14:48:59 -0400
> Chuck Lever <[email protected]> wrote:
>
>>
>> On May 3, 2013, at 2:44 PM, Jeff Layton <[email protected]> wrote:
>>
>>> On Fri, 3 May 2013 18:33:54 +0000
>>> "Myklebust, Trond" <[email protected]> wrote:
>>>
>>>> On Fri, 2013-05-03 at 14:24 -0400, Jeff Layton wrote:
>>>>> On Fri, 3 May 2013 13:56:13 -0400
>>>>> Chuck Lever <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>> On May 3, 2013, at 1:25 PM, Jeff Layton <[email protected]> wrote:
>>>>>>
>>>>>>> I've noticed that when running a 3.10-pre kernel that if I try to mount
>>>>>>> up a NFSv4 filesystem that it now takes ~15s for the mount to complete.
>>>>>>>
>>>>>>> Here's a little rpcdebug output:
>>>>>>>
>>>>>>> [ 3056.385078] svc: server ffff8800368fc000 waiting for data (to = 9223372036854775807)
>>>>>>> [ 3056.392056] RPC: new task initialized, procpid 2471
>>>>>>> [ 3056.392758] RPC: allocated task ffff88010cd90100
>>>>>>> [ 3056.393303] RPC: 42 __rpc_execute flags=0x1280
>>>>>>> [ 3056.393630] RPC: 42 call_start nfs4 proc SETCLIENTID (sync)
>>>>>>> [ 3056.394056] RPC: 42 call_reserve (status 0)
>>>>>>> [ 3056.394368] RPC: 42 reserved req ffff8801019f9600 xid 21ad6c40
>>>>>>> [ 3056.394783] RPC: wake_up_first(ffff88010a989990 "xprt_sending")
>>>>>>> [ 3056.395252] RPC: 42 call_reserveresult (status 0)
>>>>>>> [ 3056.395595] RPC: 42 call_refresh (status 0)
>>>>>>> [ 3056.395901] RPC: gss_create_cred for uid 0, flavor 390004
>>>>>>> [ 3056.396361] RPC: gss_create_upcall for uid 0
>>>>>>> [ 3071.396134] RPC: AUTH_GSS upcall timed out.
>>>>>>> Please check user daemon is running.
>>>>>>> [ 3071.397374] RPC: gss_create_upcall for uid 0 result -13
>>>>>>> [ 3071.398192] RPC: 42 call_refreshresult (status -13)
>>>>>>> [ 3071.398873] RPC: 42 call_refreshresult: refresh creds failed with error -13
>>>>>>> [ 3071.399881] RPC: 42 return 0, status -13
>>>>>>>
>>>>>>> The problem is that we're now trying to upcall for GSS creds to do the
>>>>>>> SETCLIENTID call, but this host isn't running rpc.gssd. Not running
>>>>>>> rpc.gssd is pretty common for people not using kerberized NFS. I think
>>>>>>> we'll see a lot of complaints about this.
>>>>>>>
>>>>>>> Is this expected?
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>> There are operations like SETCLIENTID and GETATTR(fs_locations) which should always use an integrity-checking security flavor, even if particular mount points use sec=sys.
>>>>>>
>>>>>> There are cases where GSS is not available, and we fall back to using AUTH_SYS. That should happen as quickly as possible, I agree.
>>>>>>
>>>>>>> If so, what's the proposed remedy?
>>>>>>> Simply have everyone run rpc.gssd even if they're not using kerberized NFS?
>>>>>>
>>>>>>
>>>>>> That's one possibility. Or we could shorten the upcall timeout. Or, add a mechanism by which rpc.gssd can provide a positive indication to the kernel that it is running.
>>>>>>
>>>>>> It doesn't seem like an intractable problem.
>>>>>>
>>>>>
>>>>> Nope, it's not intractable at all...
>>>>>
>>>>> Currently, the gssd upcall uses the RPC_PIPE_WAIT_FOR_OPEN flag to
>>>>> allow you to queue upcalls to be processed when the daemon isn't up
>>>>> yet. When the daemon starts, it processes that queue. The caller gives
>>>>> up after 15s (which is what's happening here), and the upcall
>>>>> eventually gets scraped out of the queue after 30s.
>>>>>
>>>>> We could stop using that flag on this rpc_pipe and simply require that
>>>>> the daemon be up and running before attempting any sort of AUTH_GSS
>>>>> rpc. That might be a little less friendly in the face of boot-time
>>>>> ordering problems, but it should presumably make this problem go away.
>>>>
>>>> You probably don't want to do that... The main reason for the
>>>> RPC_PIPE_WAIT_FOR_OPEN is that even if the gssd daemon is running, it
>>>> takes it a moment or two to notice that a new client directory has been
>>>> created, and that there is a new 'krb' pipe to attach to.
>>>>
>>>
>>> Ok yeah, good point...
>>>
>>> Shortening the timeout will also suck -- that'll just reduce the pain
>>> somewhat but will still be a performance regression. It looks like even
>>> specifying '-o sec=sys' doesn't disable this behavior. Should it?
>>
>> Nope.
>>
>> We should always use krb5i if a GSS context can be established with our machine cred. As I said before, SETCLIENTID and GETATTR(fs_locations) really should use an integrity-protecting security flavor no matter what flavor is in effect on the mount points themselves.
>>
>>> Instead of using AUTH_GSS for SETCLIENTID by default, would it make
>>> sense to add a switch (module parm?) that turns it on so that it can be
>>> an opt-in thing rather than doing this by default?
>>
>> Why add another tunable when we really should just fix the delay?
>>
>
> Because just shortening the delay will still leave you with a delay.
> Less people might notice and complain if it's shorter, but it'll still
> be there. It'll be particularly annoying with autofs...
>
> You also run the risk of hitting the problem Trond mentioned if you
> shorten it too much (timing out the upcall before gssd's duty cycle has
> a chance to get to it).

So what about taking one of the other approaches I mentioned?

>
>> Besides, if gssd is running and no keytab exists, then the fallback to AUTH_SYS should be fast. Is that not an effective workaround until we address the delay problem?
>>
>
> Yep, no problem if gssd is running. I'm concerned about the common case
> where it isn't. The expectation in the past has always been that if you
> weren't running kerberized NFS that you didn't need to run gssd. That
> has now changed and if you don't want to suffer a delay when mounting
> (however short it eventually is) then you need to run it.

Why are you assuming this is a permanent change?

> Might it make sense to introduce this change more gradually? Somehow
> warn people who aren't running gssd that they ought to start turning it
> on before we do this by default?

I don't expect this issue to last for release after release. A moment ago you agreed that this shouldn't be intractable, so I fail to see the need to start wiring up long-term workarounds.

Can't we just agree on a fix, and then get that into 3.10 as a regression fix?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2013-05-07 21:24:06

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On Tue, 7 May 2013 16:53:21 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Fri, May 03, 2013 at 05:50:49PM -0400, Chuck Lever wrote:
> >
> > On May 3, 2013, at 5:18 PM, "J. Bruce Fields" <[email protected]> wrote:
> >
> > > On Fri, May 03, 2013 at 02:48:59PM -0400, Chuck Lever wrote:
> > >> We should always use krb5i if a GSS context can be established with
> > >> our machine cred. As I said before, SETCLIENTID and
> > >> GETATTR(fs_locations) really should use an integrity-protecting
> > >> security flavor no matter what flavor is in effect on the mount points
> > >> themselves.
> > >
> > > Can you give an example of a threat that could be avoided by this?
> > >
> > > My suspicion is that in most cases an attacker with the ability to
> > > subvert auth_sys could *also* DOS gssd, and hence force the fallback to
> > > auth_sys.
> > >
> > > krb5i plus a fallback to auth_sys on failure to authenticate doesn't
> > > sound to me much more secure than just auth_sys.
> >
> > Our current situation is that the first mount of a server determines the flavor to use for SETCLIENTID. So if that mount happens to be "sec=sys" the SETCLIENTID is done with AUTH_SYS no matter what the subsequent mounts request.
> >
> > That's just about as secure in many cases as falling back.
> >
> > > If we really want much security benefit from krb5i on state operations,
> > > I think we need to really *require* krb5i.
> > >
> > > So I'm inclined towards Jeff's solution: don't do this unless userspace
> > > somehow affirmatively states that it requires krb5i on state operations.
> >
> > This would have to be on a per-server basis. Where would an admin specify such an option? I don't believe either a mount option (too fine) or a module parameter (too coarse) is appropriate.
>
> Why do you think a mount option is too fine? They can use nfsmount.conf
> to specify per-server or global defaults.
>
> > > I agree that we should default to this when we can. But the way to move
> > > towards that default is then to get distributions to turn on the new
> > > module parm (or whatever it is) by default. As they do so, they can
> > > also ensure that e.g. gssd is started.
> >
> > gssd is not all that's required.
>
> Isn't that's all that's required to fix the delay problem? Or do you
> still get a delay if you run gssd but don't create a keytab?
>

It seems like running gssd is sufficient to make the delay go away
for me. The upcall gets a quick negative response and then the kernel
falls back to doing AUTH_SYS.

While that's a reasonable workaround for now, I'm not sure we want a
solution that requires having yet another daemon running if we can get
away with it.

--
Jeff Layton <[email protected]>

2013-05-03 21:18:14

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On Fri, May 03, 2013 at 02:48:59PM -0400, Chuck Lever wrote:
> We should always use krb5i if a GSS context can be established with
> our machine cred. As I said before, SETCLIENTID and
> GETATTR(fs_locations) really should use an integrity-protecting
> security flavor no matter what flavor is in effect on the mount points
> themselves.

Can you give an example of a threat that could be avoided by this?

My suspicion is that in most cases an attacker with the ability to
subvert auth_sys could *also* DOS gssd, and hence force the fallback to
auth_sys.

krb5i plus a fallback to auth_sys on failure to authenticate doesn't
sound to me much more secure than just auth_sys.

If we really want much security benefit from krb5i on state operations,
I think we need to really *require* krb5i.

So I'm inclined towards Jeff's solution: don't do this unless userspace
somehow affirmatively states that it requires krb5i on state operations.

I agree that we should default to this when we can. But the way to move
towards that default is then to get distributions to turn on the new
module parm (or whatever it is) by default. As they do so, they can
also ensure that e.g. gssd is started.

--b.

> > Instead of using AUTH_GSS for SETCLIENTID by default, would it make
> > sense to add a switch (module parm?) that turns it on so that it can be
> > an opt-in thing rather than doing this by default?
>
> Why add another tunable when we really should just fix the delay?
>
> Besides, if gssd is running and no keytab exists, then the fallback to AUTH_SYS should be fast. Is that not an effective workaround until we address the delay problem?
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-05-03 17:56:20

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On May 3, 2013, at 1:25 PM, Jeff Layton <[email protected]> wrote:

> I've noticed that when running a 3.10-pre kernel that if I try to mount
> up a NFSv4 filesystem that it now takes ~15s for the mount to complete.
>
> Here's a little rpcdebug output:
>
> [ 3056.385078] svc: server ffff8800368fc000 waiting for data (to = 9223372036854775807)
> [ 3056.392056] RPC: new task initialized, procpid 2471
> [ 3056.392758] RPC: allocated task ffff88010cd90100
> [ 3056.393303] RPC: 42 __rpc_execute flags=0x1280
> [ 3056.393630] RPC: 42 call_start nfs4 proc SETCLIENTID (sync)
> [ 3056.394056] RPC: 42 call_reserve (status 0)
> [ 3056.394368] RPC: 42 reserved req ffff8801019f9600 xid 21ad6c40
> [ 3056.394783] RPC: wake_up_first(ffff88010a989990 "xprt_sending")
> [ 3056.395252] RPC: 42 call_reserveresult (status 0)
> [ 3056.395595] RPC: 42 call_refresh (status 0)
> [ 3056.395901] RPC: gss_create_cred for uid 0, flavor 390004
> [ 3056.396361] RPC: gss_create_upcall for uid 0
> [ 3071.396134] RPC: AUTH_GSS upcall timed out.
> Please check user daemon is running.
> [ 3071.397374] RPC: gss_create_upcall for uid 0 result -13
> [ 3071.398192] RPC: 42 call_refreshresult (status -13)
> [ 3071.398873] RPC: 42 call_refreshresult: refresh creds failed with error -13
> [ 3071.399881] RPC: 42 return 0, status -13
>
> The problem is that we're now trying to upcall for GSS creds to do the
> SETCLIENTID call, but this host isn't running rpc.gssd. Not running
> rpc.gssd is pretty common for people not using kerberized NFS. I think
> we'll see a lot of complaints about this.
>
> Is this expected?

Yes.

There are operations like SETCLIENTID and GETATTR(fs_locations) which should always use an integrity-checking security flavor, even if particular mount points use sec=sys.

There are cases where GSS is not available, and we fall back to using AUTH_SYS. That should happen as quickly as possible, I agree.

> If so, what's the proposed remedy?
> Simply have everyone run rpc.gssd even if they're not using kerberized NFS?

That's one possibility. Or we could shorten the upcall timeout. Or, add a mechanism by which rpc.gssd can provide a positive indication to the kernel that it is running.

It doesn't seem like an intractable problem.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2013-05-03 18:24:24

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On Fri, 3 May 2013 13:56:13 -0400
Chuck Lever <[email protected]> wrote:

>
> On May 3, 2013, at 1:25 PM, Jeff Layton <[email protected]> wrote:
>
> > I've noticed that when running a 3.10-pre kernel that if I try to mount
> > up a NFSv4 filesystem that it now takes ~15s for the mount to complete.
> >
> > Here's a little rpcdebug output:
> >
> > [ 3056.385078] svc: server ffff8800368fc000 waiting for data (to = 9223372036854775807)
> > [ 3056.392056] RPC: new task initialized, procpid 2471
> > [ 3056.392758] RPC: allocated task ffff88010cd90100
> > [ 3056.393303] RPC: 42 __rpc_execute flags=0x1280
> > [ 3056.393630] RPC: 42 call_start nfs4 proc SETCLIENTID (sync)
> > [ 3056.394056] RPC: 42 call_reserve (status 0)
> > [ 3056.394368] RPC: 42 reserved req ffff8801019f9600 xid 21ad6c40
> > [ 3056.394783] RPC: wake_up_first(ffff88010a989990 "xprt_sending")
> > [ 3056.395252] RPC: 42 call_reserveresult (status 0)
> > [ 3056.395595] RPC: 42 call_refresh (status 0)
> > [ 3056.395901] RPC: gss_create_cred for uid 0, flavor 390004
> > [ 3056.396361] RPC: gss_create_upcall for uid 0
> > [ 3071.396134] RPC: AUTH_GSS upcall timed out.
> > Please check user daemon is running.
> > [ 3071.397374] RPC: gss_create_upcall for uid 0 result -13
> > [ 3071.398192] RPC: 42 call_refreshresult (status -13)
> > [ 3071.398873] RPC: 42 call_refreshresult: refresh creds failed with error -13
> > [ 3071.399881] RPC: 42 return 0, status -13
> >
> > The problem is that we're now trying to upcall for GSS creds to do the
> > SETCLIENTID call, but this host isn't running rpc.gssd. Not running
> > rpc.gssd is pretty common for people not using kerberized NFS. I think
> > we'll see a lot of complaints about this.
> >
> > Is this expected?
>
> Yes.
>
> There are operations like SETCLIENTID and GETATTR(fs_locations) which should always use an integrity-checking security flavor, even if particular mount points use sec=sys.
>
> There are cases where GSS is not available, and we fall back to using AUTH_SYS. That should happen as quickly as possible, I agree.
>
> > If so, what's the proposed remedy?
> > Simply have everyone run rpc.gssd even if they're not using kerberized NFS?
>
>
> That's one possibility. Or we could shorten the upcall timeout. Or, add a mechanism by which rpc.gssd can provide a positive indication to the kernel that it is running.
>
> It doesn't seem like an intractable problem.
>

Nope, it's not intractable at all...

Currently, the gssd upcall uses the RPC_PIPE_WAIT_FOR_OPEN flag to
allow you to queue upcalls to be processed when the daemon isn't up
yet. When the daemon starts, it processes that queue. The caller gives
up after 15s (which is what's happening here), and the upcall
eventually gets scraped out of the queue after 30s.

We could stop using that flag on this rpc_pipe and simply require that
the daemon be up and running before attempting any sort of AUTH_GSS
rpc. That might be a little less friendly in the face of boot-time
ordering problems, but it should presumably make this problem go away.

--
Jeff Layton <[email protected]>

2013-05-03 18:44:40

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On Fri, 3 May 2013 18:33:54 +0000
"Myklebust, Trond" <[email protected]> wrote:

> On Fri, 2013-05-03 at 14:24 -0400, Jeff Layton wrote:
> > On Fri, 3 May 2013 13:56:13 -0400
> > Chuck Lever <[email protected]> wrote:
> >
> > >
> > > On May 3, 2013, at 1:25 PM, Jeff Layton <[email protected]> wrote:
> > >
> > > > I've noticed that when running a 3.10-pre kernel that if I try to mount
> > > > up a NFSv4 filesystem that it now takes ~15s for the mount to complete.
> > > >
> > > > Here's a little rpcdebug output:
> > > >
> > > > [ 3056.385078] svc: server ffff8800368fc000 waiting for data (to = 9223372036854775807)
> > > > [ 3056.392056] RPC: new task initialized, procpid 2471
> > > > [ 3056.392758] RPC: allocated task ffff88010cd90100
> > > > [ 3056.393303] RPC: 42 __rpc_execute flags=0x1280
> > > > [ 3056.393630] RPC: 42 call_start nfs4 proc SETCLIENTID (sync)
> > > > [ 3056.394056] RPC: 42 call_reserve (status 0)
> > > > [ 3056.394368] RPC: 42 reserved req ffff8801019f9600 xid 21ad6c40
> > > > [ 3056.394783] RPC: wake_up_first(ffff88010a989990 "xprt_sending")
> > > > [ 3056.395252] RPC: 42 call_reserveresult (status 0)
> > > > [ 3056.395595] RPC: 42 call_refresh (status 0)
> > > > [ 3056.395901] RPC: gss_create_cred for uid 0, flavor 390004
> > > > [ 3056.396361] RPC: gss_create_upcall for uid 0
> > > > [ 3071.396134] RPC: AUTH_GSS upcall timed out.
> > > > Please check user daemon is running.
> > > > [ 3071.397374] RPC: gss_create_upcall for uid 0 result -13
> > > > [ 3071.398192] RPC: 42 call_refreshresult (status -13)
> > > > [ 3071.398873] RPC: 42 call_refreshresult: refresh creds failed with error -13
> > > > [ 3071.399881] RPC: 42 return 0, status -13
> > > >
> > > > The problem is that we're now trying to upcall for GSS creds to do the
> > > > SETCLIENTID call, but this host isn't running rpc.gssd. Not running
> > > > rpc.gssd is pretty common for people not using kerberized NFS. I think
> > > > we'll see a lot of complaints about this.
> > > >
> > > > Is this expected?
> > >
> > > Yes.
> > >
> > > There are operations like SETCLIENTID and GETATTR(fs_locations) which should always use an integrity-checking security flavor, even if particular mount points use sec=sys.
> > >
> > > There are cases where GSS is not available, and we fall back to using AUTH_SYS. That should happen as quickly as possible, I agree.
> > >
> > > > If so, what's the proposed remedy?
> > > > Simply have everyone run rpc.gssd even if they're not using kerberized NFS?
> > >
> > >
> > > That's one possibility. Or we could shorten the upcall timeout. Or, add a mechanism by which rpc.gssd can provide a positive indication to the kernel that it is running.
> > >
> > > It doesn't seem like an intractable problem.
> > >
> >
> > Nope, it's not intractable at all...
> >
> > Currently, the gssd upcall uses the RPC_PIPE_WAIT_FOR_OPEN flag to
> > allow you to queue upcalls to be processed when the daemon isn't up
> > yet. When the daemon starts, it processes that queue. The caller gives
> > up after 15s (which is what's happening here), and the upcall
> > eventually gets scraped out of the queue after 30s.
> >
> > We could stop using that flag on this rpc_pipe and simply require that
> > the daemon be up and running before attempting any sort of AUTH_GSS
> > rpc. That might be a little less friendly in the face of boot-time
> > ordering problems, but it should presumably make this problem go away.
>
> You probably don't want to do that... The main reason for the
> RPC_PIPE_WAIT_FOR_OPEN is that even if the gssd daemon is running, it
> takes it a moment or two to notice that a new client directory has been
> created, and that there is a new 'krb' pipe to attach to.
>

Ok yeah, good point...

Shortening the timeout will also suck -- that'll just reduce the pain
somewhat but will still be a performance regression. It looks like even
specifying '-o sec=sys' doesn't disable this behavior. Should it?

Instead of using AUTH_GSS for SETCLIENTID by default, would it make
sense to add a switch (module parm?) that turns it on so that it can be
an opt-in thing rather than doing this by default?

--
Jeff Layton <[email protected]>

2013-05-06 22:34:05

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On May 3, 2013, at 5:50 PM, Chuck Lever <[email protected]> wrote:

>
> On May 3, 2013, at 5:18 PM, "J. Bruce Fields" <[email protected]> wrote:
>
>> On Fri, May 03, 2013 at 02:48:59PM -0400, Chuck Lever wrote:
>>> We should always use krb5i if a GSS context can be established with
>>> our machine cred. As I said before, SETCLIENTID and
>>> GETATTR(fs_locations) really should use an integrity-protecting
>>> security flavor no matter what flavor is in effect on the mount points
>>> themselves.
>>
>> Can you give an example of a threat that could be avoided by this?
>>
>> My suspicion is that in most cases an attacker with the ability to
>> subvert auth_sys could *also* DOS gssd, and hence force the fallback to
>> auth_sys.
>>
>> krb5i plus a fallback to auth_sys on failure to authenticate doesn't
>> sound to me much more secure than just auth_sys.
>
> Our current situation is that the first mount of a server determines the flavor to use for SETCLIENTID. So if that mount happens to be "sec=sys" the SETCLIENTID is done with AUTH_SYS no matter what the subsequent mounts request.
>
> That's just about as secure in many cases as falling back.
>
>> If we really want much security benefit from krb5i on state operations,
>> I think we need to really *require* krb5i.
>>
>> So I'm inclined towards Jeff's solution: don't do this unless userspace
>> somehow affirmatively states that it requires krb5i on state operations.
>
> This would have to be on a per-server basis. Where would an admin specify such an option? I don't believe either a mount option (too fine) or a module parameter (too coarse) is appropriate.
>
>> I agree that we should default to this when we can. But the way to move
>> towards that default is then to get distributions to turn on the new
>> module parm (or whatever it is) by default. As they do so, they can
>> also ensure that e.g. gssd is started.
>
> gssd is not all that's required. A keytab must be provisioned on the client and server for it to work, and that's the main issue I'm trying to address: We need "sec=krb5" mounts to work when the client has no GSS machine credential.
>
> And I think we already have a problem with gssd not picking up kernel requests quickly, as Trond pointed out.
>
> I don't feel like any of this is new, or that any of this is a strong reason to revert. But it could be a reason to move forward from here.
>
> Does gssd distinguish between:
>
> - no local keytab
>
> - no local support for GSS
>
> - server doesn't grok GSS
>
> - some network problem occurred
>
> Can it communicate that distinction to the kernel? What if we fall back only in the "no keytab" and "client's kernel has no GSS support" cases? In the "server doesn't grok" case, do not fall back if "sec=krb5*" is specified on the mount point? "A network problem occurred" is an "always fail" case.

After some thought, here's an algorithm for selecting a flavor to use for state management operations:

Start with krb5i.

- If a GSS flavor is specified on the mount point, and if
there is no local keytab, fall back to AUTH_SYS; otherwise
if any other issue occurs, fail immediately. This
modification should address your security concern.

- If a non-GSS flavor is specified on the mount point, or no
flavor is specified, and there is any problem with krb5i,
fall back to AUTH_SYS. This is the current 3.10 behavior,
and assumes there is a solution to Jeff's 15 second upcall
delay issue.

We need the client to be more deterministic than "use the security flavor on the first mount operation done by this client" when selecting a flavor for state management. If the client chooses arbitrarily among multiple flavors, it runs the risk of getting NFS4ERR_CLID_INUSE from servers, since this flavor is no longer part of the nfs_client_id4.id string.

I'm happy to consider a separate setting on the client for this purpose, but I'm having a hard time imagining what the administrative interface might look like.

The prospect of getting a distinctive error code from rpc.gssd on current systems for the "no keytab" case is not good. I'm looking at process_krb5_upcall() in nfs-utils 1.2.8: from what I can tell, the downcall error is either -EKEYEXPIRED when the user credential has expired, or -EACCES for everything else.

However, I can write a patch to change rpc.gssd to return, say, -ENOKEY, if no keytab is available when the kernel requests a machine credential.

Is that worth pursuing in the long run?

>
>
>>
>> --b.
>>
>>>> Instead of using AUTH_GSS for SETCLIENTID by default, would it make
>>>> sense to add a switch (module parm?) that turns it on so that it can be
>>>> an opt-in thing rather than doing this by default?
>>>
>>> Why add another tunable when we really should just fix the delay?
>>>
>>> Besides, if gssd is running and no keytab exists, then the fallback to AUTH_SYS should be fast. Is that not an effective workaround until we address the delay problem?
>>>
>>> --
>>> Chuck Lever
>>> chuck[dot]lever[at]oracle[dot]com
>>>
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2013-05-07 21:44:24

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On Tue, May 07, 2013 at 09:36:49PM +0000, Myklebust, Trond wrote:
> On Tue, 2013-05-07 at 17:02 -0400, J. Bruce Fields wrote:
> > I suppose if gssd is running then it should always hold open the parent
> > (rpc_pipefs/nfs) directory. So if that isn't open it might be safe to
> > assume we can fail immediately.
>
> If rpc.idmapd is running, then it will do the same.

Whoops, you're right.

> One possible solution might simply be to put up a 'gssd' pipe in
> rpc_pipefs/nfs and use that as a metric. The problem is that IIRC,
> rpc.gssd will release all pipes and then reopen them on getting a new
> directory notification...

Yeah, longer term it'd be nice to have some better way to recognize when
gssd is up than that wait_for_pipe_open hack, but I don't see anything
that works with current gssd.

--b.

2013-05-07 21:02:02

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On Mon, May 06, 2013 at 06:33:48PM -0400, Chuck Lever wrote:
>
> On May 3, 2013, at 5:50 PM, Chuck Lever <[email protected]> wrote:
>
> >
> > On May 3, 2013, at 5:18 PM, "J. Bruce Fields" <[email protected]> wrote:
> >
> >> On Fri, May 03, 2013 at 02:48:59PM -0400, Chuck Lever wrote:
> >>> We should always use krb5i if a GSS context can be established with
> >>> our machine cred. As I said before, SETCLIENTID and
> >>> GETATTR(fs_locations) really should use an integrity-protecting
> >>> security flavor no matter what flavor is in effect on the mount points
> >>> themselves.
> >>
> >> Can you give an example of a threat that could be avoided by this?
> >>
> >> My suspicion is that in most cases an attacker with the ability to
> >> subvert auth_sys could *also* DOS gssd, and hence force the fallback to
> >> auth_sys.
> >>
> >> krb5i plus a fallback to auth_sys on failure to authenticate doesn't
> >> sound to me much more secure than just auth_sys.
> >
> > Our current situation is that the first mount of a server determines the flavor to use for SETCLIENTID. So if that mount happens to be "sec=sys" the SETCLIENTID is done with AUTH_SYS no matter what the subsequent mounts request.
> >
> > That's just about as secure in many cases as falling back.
> >
> >> If we really want much security benefit from krb5i on state operations,
> >> I think we need to really *require* krb5i.
> >>
> >> So I'm inclined towards Jeff's solution: don't do this unless userspace
> >> somehow affirmatively states that it requires krb5i on state operations.
> >
> > This would have to be on a per-server basis. Where would an admin specify such an option? I don't believe either a mount option (too fine) or a module parameter (too coarse) is appropriate.
> >
> >> I agree that we should default to this when we can. But the way to move
> >> towards that default is then to get distributions to turn on the new
> >> module parm (or whatever it is) by default. As they do so, they can
> >> also ensure that e.g. gssd is started.
> >
> > gssd is not all that's required. A keytab must be provisioned on the client and server for it to work, and that's the main issue I'm trying to address: We need "sec=krb5" mounts to work when the client has no GSS machine credential.
> >
> > And I think we already have a problem with gssd not picking up kernel requests quickly, as Trond pointed out.
> >
> > I don't feel like any of this is new, or that any of this is a strong reason to revert. But it could be a reason to move forward from here.
> >
> > Does gssd distinguish between:
> >
> > - no local keytab
> >
> > - no local support for GSS
> >
> > - server doesn't grok GSS
> >
> > - some network problem occurred
> >
> > Can it communicate that distinction to the kernel? What if we fall back only in the "no keytab" and "client's kernel has no GSS support" cases? In the "server doesn't grok" case, do not fall back if "sec=krb5*" is specified on the mount point? "A network problem occurred" is an "always fail" case.
>
> After some thought, here's an algorithm for selecting a flavor to use for state management operations:
>
> Start with krb5i.
>
> - If a GSS flavor is specified on the mount point, and if
> there is no local keytab, fall back to AUTH_SYS; otherwise
> if any other issue occurs, fail immediately. This
> modification should address your security concern.
>
> - If a non-GSS flavor is specified on the mount point, or no
> flavor is specified, and there is any problem with krb5i,
> fall back to AUTH_SYS. This is the current 3.10 behavior,
> and assumes there is a solution to Jeff's 15 second upcall
> delay issue.

I'm having a hard time thinking of one....

I suppose if gssd is running then it should always hold open the parent
(rpc_pipefs/nfs) directory. So if that isn't open it might be safe to
assume we can fail immediately.

Right now for most people the effect of an upgrade to 3.10 is a new 15
second delay on mount? (I'm assuming distributions default to not
running gssd.) Seems painful.

--b.

2013-05-07 21:27:21

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On May 7, 2013, at 4:53 PM, "J. Bruce Fields" <[email protected]> wrote:

> On Fri, May 03, 2013 at 05:50:49PM -0400, Chuck Lever wrote:
>>
>> On May 3, 2013, at 5:18 PM, "J. Bruce Fields" <[email protected]> wrote:
>>
>>> On Fri, May 03, 2013 at 02:48:59PM -0400, Chuck Lever wrote:
>>>> We should always use krb5i if a GSS context can be established with
>>>> our machine cred. As I said before, SETCLIENTID and
>>>> GETATTR(fs_locations) really should use an integrity-protecting
>>>> security flavor no matter what flavor is in effect on the mount points
>>>> themselves.
>>>
>>> Can you give an example of a threat that could be avoided by this?
>>>
>>> My suspicion is that in most cases an attacker with the ability to
>>> subvert auth_sys could *also* DOS gssd, and hence force the fallback to
>>> auth_sys.
>>>
>>> krb5i plus a fallback to auth_sys on failure to authenticate doesn't
>>> sound to me much more secure than just auth_sys.
>>
>> Our current situation is that the first mount of a server determines the flavor to use for SETCLIENTID. So if that mount happens to be "sec=sys" the SETCLIENTID is done with AUTH_SYS no matter what the subsequent mounts request.
>>
>> That's just about as secure in many cases as falling back.
>>
>>> If we really want much security benefit from krb5i on state operations,
>>> I think we need to really *require* krb5i.
>>>
>>> So I'm inclined towards Jeff's solution: don't do this unless userspace
>>> somehow affirmatively states that it requires krb5i on state operations.
>>
>> This would have to be on a per-server basis. Where would an admin specify such an option? I don't believe either a mount option (too fine) or a module parameter (too coarse) is appropriate.
>
> Why do you think a mount option is too fine? They can use nfsmount.conf
> to specify per-server or global defaults.
>
>>> I agree that we should default to this when we can. But the way to move
>>> towards that default is then to get distributions to turn on the new
>>> module parm (or whatever it is) by default. As they do so, they can
>>> also ensure that e.g. gssd is started.
>>
>> gssd is not all that's required.
>
> Isn't that's all that's required to fix the delay problem?

It's all that is required to work around the delay. But it does not address the security concern you brought up last week.

> Or do you
> still get a delay if you run gssd but don't create a keytab?
>
> --b.
>
>> A keytab must be provisioned on the client and server for it to work, and that's the main issue I'm trying to address: We need "sec=krb5" mounts to work when the client has no GSS machine credential.
>>
>> And I think we already have a problem with gssd not picking up kernel requests quickly, as Trond pointed out.
>>
>> I don't feel like any of this is new, or that any of this is a strong reason to revert. But it could be a reason to move forward from here.
>>
>> Does gssd distinguish between:
>>
>> - no local keytab
>>
>> - no local support for GSS
>>
>> - server doesn't grok GSS
>>
>> - some network problem occurred
>>
>> Can it communicate that distinction to the kernel? What if we fall back only in the "no keytab" and "client's kernel has no GSS support" cases? In the "server doesn't grok" case, do not fall back if "sec=krb5*" is specified on the mount point? "A network problem occurred" is an "always fail" case.
>>
>>
>>
>>>
>>> --b.
>>>
>>>>> Instead of using AUTH_GSS for SETCLIENTID by default, would it make
>>>>> sense to add a switch (module parm?) that turns it on so that it can be
>>>>> an opt-in thing rather than doing this by default?
>>>>
>>>> Why add another tunable when we really should just fix the delay?
>>>>
>>>> Besides, if gssd is running and no keytab exists, then the fallback to AUTH_SYS should be fast. Is that not an effective workaround until we address the delay problem?
>>>>
>>>> --
>>>> Chuck Lever
>>>> chuck[dot]lever[at]oracle[dot]com
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2013-05-03 19:17:31

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On Fri, 3 May 2013 14:48:59 -0400
Chuck Lever <[email protected]> wrote:

>
> On May 3, 2013, at 2:44 PM, Jeff Layton <[email protected]> wrote:
>
> > On Fri, 3 May 2013 18:33:54 +0000
> > "Myklebust, Trond" <[email protected]> wrote:
> >
> >> On Fri, 2013-05-03 at 14:24 -0400, Jeff Layton wrote:
> >>> On Fri, 3 May 2013 13:56:13 -0400
> >>> Chuck Lever <[email protected]> wrote:
> >>>
> >>>>
> >>>> On May 3, 2013, at 1:25 PM, Jeff Layton <[email protected]> wrote:
> >>>>
> >>>>> I've noticed that when running a 3.10-pre kernel that if I try to mount
> >>>>> up a NFSv4 filesystem that it now takes ~15s for the mount to complete.
> >>>>>
> >>>>> Here's a little rpcdebug output:
> >>>>>
> >>>>> [ 3056.385078] svc: server ffff8800368fc000 waiting for data (to = 9223372036854775807)
> >>>>> [ 3056.392056] RPC: new task initialized, procpid 2471
> >>>>> [ 3056.392758] RPC: allocated task ffff88010cd90100
> >>>>> [ 3056.393303] RPC: 42 __rpc_execute flags=0x1280
> >>>>> [ 3056.393630] RPC: 42 call_start nfs4 proc SETCLIENTID (sync)
> >>>>> [ 3056.394056] RPC: 42 call_reserve (status 0)
> >>>>> [ 3056.394368] RPC: 42 reserved req ffff8801019f9600 xid 21ad6c40
> >>>>> [ 3056.394783] RPC: wake_up_first(ffff88010a989990 "xprt_sending")
> >>>>> [ 3056.395252] RPC: 42 call_reserveresult (status 0)
> >>>>> [ 3056.395595] RPC: 42 call_refresh (status 0)
> >>>>> [ 3056.395901] RPC: gss_create_cred for uid 0, flavor 390004
> >>>>> [ 3056.396361] RPC: gss_create_upcall for uid 0
> >>>>> [ 3071.396134] RPC: AUTH_GSS upcall timed out.
> >>>>> Please check user daemon is running.
> >>>>> [ 3071.397374] RPC: gss_create_upcall for uid 0 result -13
> >>>>> [ 3071.398192] RPC: 42 call_refreshresult (status -13)
> >>>>> [ 3071.398873] RPC: 42 call_refreshresult: refresh creds failed with error -13
> >>>>> [ 3071.399881] RPC: 42 return 0, status -13
> >>>>>
> >>>>> The problem is that we're now trying to upcall for GSS creds to do the
> >>>>> SETCLIENTID call, but this host isn't running rpc.gssd. Not running
> >>>>> rpc.gssd is pretty common for people not using kerberized NFS. I think
> >>>>> we'll see a lot of complaints about this.
> >>>>>
> >>>>> Is this expected?
> >>>>
> >>>> Yes.
> >>>>
> >>>> There are operations like SETCLIENTID and GETATTR(fs_locations) which should always use an integrity-checking security flavor, even if particular mount points use sec=sys.
> >>>>
> >>>> There are cases where GSS is not available, and we fall back to using AUTH_SYS. That should happen as quickly as possible, I agree.
> >>>>
> >>>>> If so, what's the proposed remedy?
> >>>>> Simply have everyone run rpc.gssd even if they're not using kerberized NFS?
> >>>>
> >>>>
> >>>> That's one possibility. Or we could shorten the upcall timeout. Or, add a mechanism by which rpc.gssd can provide a positive indication to the kernel that it is running.
> >>>>
> >>>> It doesn't seem like an intractable problem.
> >>>>
> >>>
> >>> Nope, it's not intractable at all...
> >>>
> >>> Currently, the gssd upcall uses the RPC_PIPE_WAIT_FOR_OPEN flag to
> >>> allow you to queue upcalls to be processed when the daemon isn't up
> >>> yet. When the daemon starts, it processes that queue. The caller gives
> >>> up after 15s (which is what's happening here), and the upcall
> >>> eventually gets scraped out of the queue after 30s.
> >>>
> >>> We could stop using that flag on this rpc_pipe and simply require that
> >>> the daemon be up and running before attempting any sort of AUTH_GSS
> >>> rpc. That might be a little less friendly in the face of boot-time
> >>> ordering problems, but it should presumably make this problem go away.
> >>
> >> You probably don't want to do that... The main reason for the
> >> RPC_PIPE_WAIT_FOR_OPEN is that even if the gssd daemon is running, it
> >> takes it a moment or two to notice that a new client directory has been
> >> created, and that there is a new 'krb' pipe to attach to.
> >>
> >
> > Ok yeah, good point...
> >
> > Shortening the timeout will also suck -- that'll just reduce the pain
> > somewhat but will still be a performance regression. It looks like even
> > specifying '-o sec=sys' doesn't disable this behavior. Should it?
>
> Nope.
>
> We should always use krb5i if a GSS context can be established with our machine cred. As I said before, SETCLIENTID and GETATTR(fs_locations) really should use an integrity-protecting security flavor no matter what flavor is in effect on the mount points themselves.
>
> > Instead of using AUTH_GSS for SETCLIENTID by default, would it make
> > sense to add a switch (module parm?) that turns it on so that it can be
> > an opt-in thing rather than doing this by default?
>
> Why add another tunable when we really should just fix the delay?
>

Because just shortening the delay will still leave you with a delay.
Less people might notice and complain if it's shorter, but it'll still
be there. It'll be particularly annoying with autofs...

You also run the risk of hitting the problem Trond mentioned if you
shorten it too much (timing out the upcall before gssd's duty cycle has
a chance to get to it).

> Besides, if gssd is running and no keytab exists, then the fallback to AUTH_SYS should be fast. Is that not an effective workaround until we address the delay problem?
>

Yep, no problem if gssd is running. I'm concerned about the common case
where it isn't. The expectation in the past has always been that if you
weren't running kerberized NFS that you didn't need to run gssd. That
has now changed and if you don't want to suffer a delay when mounting
(however short it eventually is) then you need to run it.

Might it make sense to introduce this change more gradually? Somehow
warn people who aren't running gssd that they ought to start turning it
on before we do this by default?

--
Jeff Layton <[email protected]>

2013-05-03 18:49:05

[permalink] [raw]

Subject: Re: long delay when mounting due to SETCLIENTID AUTH_GSS attempts

On May 3, 2013, at 2:44 PM, Jeff Layton <[email protected]> wrote:

> On Fri, 3 May 2013 18:33:54 +0000
> "Myklebust, Trond" <[email protected]> wrote:
>
>> On Fri, 2013-05-03 at 14:24 -0400, Jeff Layton wrote:
>>> On Fri, 3 May 2013 13:56:13 -0400
>>> Chuck Lever <[email protected]> wrote:
>>>
>>>>
>>>> On May 3, 2013, at 1:25 PM, Jeff Layton <[email protected]> wrote:
>>>>
>>>>> I've noticed that when running a 3.10-pre kernel that if I try to mount
>>>>> up a NFSv4 filesystem that it now takes ~15s for the mount to complete.
>>>>>
>>>>> Here's a little rpcdebug output:
>>>>>
>>>>> [ 3056.385078] svc: server ffff8800368fc000 waiting for data (to = 9223372036854775807)
>>>>> [ 3056.392056] RPC: new task initialized, procpid 2471
>>>>> [ 3056.392758] RPC: allocated task ffff88010cd90100
>>>>> [ 3056.393303] RPC: 42 __rpc_execute flags=0x1280
>>>>> [ 3056.393630] RPC: 42 call_start nfs4 proc SETCLIENTID (sync)
>>>>> [ 3056.394056] RPC: 42 call_reserve (status 0)
>>>>> [ 3056.394368] RPC: 42 reserved req ffff8801019f9600 xid 21ad6c40
>>>>> [ 3056.394783] RPC: wake_up_first(ffff88010a989990 "xprt_sending")
>>>>> [ 3056.395252] RPC: 42 call_reserveresult (status 0)
>>>>> [ 3056.395595] RPC: 42 call_refresh (status 0)
>>>>> [ 3056.395901] RPC: gss_create_cred for uid 0, flavor 390004
>>>>> [ 3056.396361] RPC: gss_create_upcall for uid 0
>>>>> [ 3071.396134] RPC: AUTH_GSS upcall timed out.
>>>>> Please check user daemon is running.
>>>>> [ 3071.397374] RPC: gss_create_upcall for uid 0 result -13
>>>>> [ 3071.398192] RPC: 42 call_refreshresult (status -13)
>>>>> [ 3071.398873] RPC: 42 call_refreshresult: refresh creds failed with error -13
>>>>> [ 3071.399881] RPC: 42 return 0, status -13
>>>>>
>>>>> The problem is that we're now trying to upcall for GSS creds to do the
>>>>> SETCLIENTID call, but this host isn't running rpc.gssd. Not running
>>>>> rpc.gssd is pretty common for people not using kerberized NFS. I think
>>>>> we'll see a lot of complaints about this.
>>>>>
>>>>> Is this expected?
>>>>
>>>> Yes.
>>>>
>>>> There are operations like SETCLIENTID and GETATTR(fs_locations) which should always use an integrity-checking security flavor, even if particular mount points use sec=sys.
>>>>
>>>> There are cases where GSS is not available, and we fall back to using AUTH_SYS. That should happen as quickly as possible, I agree.
>>>>
>>>>> If so, what's the proposed remedy?
>>>>> Simply have everyone run rpc.gssd even if they're not using kerberized NFS?
>>>>
>>>>
>>>> That's one possibility. Or we could shorten the upcall timeout. Or, add a mechanism by which rpc.gssd can provide a positive indication to the kernel that it is running.
>>>>
>>>> It doesn't seem like an intractable problem.
>>>>
>>>
>>> Nope, it's not intractable at all...
>>>
>>> Currently, the gssd upcall uses the RPC_PIPE_WAIT_FOR_OPEN flag to
>>> allow you to queue upcalls to be processed when the daemon isn't up
>>> yet. When the daemon starts, it processes that queue. The caller gives
>>> up after 15s (which is what's happening here), and the upcall
>>> eventually gets scraped out of the queue after 30s.
>>>
>>> We could stop using that flag on this rpc_pipe and simply require that
>>> the daemon be up and running before attempting any sort of AUTH_GSS
>>> rpc. That might be a little less friendly in the face of boot-time
>>> ordering problems, but it should presumably make this problem go away.
>>
>> You probably don't want to do that... The main reason for the
>> RPC_PIPE_WAIT_FOR_OPEN is that even if the gssd daemon is running, it
>> takes it a moment or two to notice that a new client directory has been
>> created, and that there is a new 'krb' pipe to attach to.
>>
>
> Ok yeah, good point...
>
> Shortening the timeout will also suck -- that'll just reduce the pain
> somewhat but will still be a performance regression. It looks like even
> specifying '-o sec=sys' doesn't disable this behavior. Should it?

Nope.

We should always use krb5i if a GSS context can be established with our machine cred. As I said before, SETCLIENTID and GETATTR(fs_locations) really should use an integrity-protecting security flavor no matter what flavor is in effect on the mount points themselves.

> Instead of using AUTH_GSS for SETCLIENTID by default, would it make
> sense to add a switch (module parm?) that turns it on so that it can be
> an opt-in thing rather than doing this by default?

Why add another tunable when we really should just fix the delay?

Besides, if gssd is running and no keytab exists, then the fallback to AUTH_SYS should be fast. Is that not an effective workaround until we address the delay problem?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2013-05-07 21:21:23