2013-09-06 23:20:51

by Kees Cook

[permalink] [raw]
Subject: race condition in crypto larval handling

Hi,

I've tracked down a race condition and ref counting problem in the
crypto API internals. We've been seeing it under Chrome OS, but it
seems it's not isolated to just us:

https://code.google.com/p/chromium/issues/detail?id=244581
http://marc.info/?l=linux-crypto-vger&m=135429403609373&w=2
https://bugzilla.redhat.com/show_bug.cgi?id=983682
http://www.mail-archive.com/[email protected]/msg07933.html

I think I've found the basic origin of the problem.
crypto_larval_add() has synchronization to make sure only a single
larval can ever be created. That logic seems totally fine. However,
this means that crypto_larval_lookup() from two threads can return the
same larval, which means that crypto_alg_mod_lookup() runs the risk of
calling crypto_larval_kill() on the same larval twice, which does not
appear to be handled sanely.

I can easily crash the kernel by forcing a synchronization point just
before the "return crypt_larval_add(...)" call in
crypto_larval_lookup(). Basically, I added this sloppy sync code (I
feel like there must be a better way to do this):

+static atomic_t global_sync = ATOMIC_INIT(0);
...
struct crypto_alg *crypto_larval_lookup(const char *name, u32 type, u32 mask)
...
+ if (strncmp(name, "asdf", 4) == 0) {
+ int value;
+
+ value = atomic_add_return(1, &global_sync);
+ if (value == 1) {
+ /* I was first to stall here, wait for inc. */
+ while (atomic_read(&global_sync) != 2)
+ cpu_relax();
+ } else if (value == 2) {
+ /* I was second to stall here, wait for dec. */
+ while (atomic_read(&global_sync) != 1)
+ cpu_relax();
+ } else {
+ BUG();
+ }
+ atomic_dec(&global_sync);
+ }

return crypto_larval_add(name, type, mask);
}

And then ran code from userspace that did, effectively:

struct sockaddr_alg sa = {
.salg_family = AF_ALG,
.salg_type = "hash",
};
strcpy(sa.salg_name, argv[1]);

fork();
...
sds[0] = socket(AF_ALG, SOCK_SEQPACKET, 0);
bind(sds[0], (struct sockaddr *) &sa, sizeof(sa));

And ran this as "./tickle asdf1" to generate the two threads trying to
find an invalid alg. The race looks possible even with valid algs, but
this was the fastest path I could see to tickle the issue.

With added printks to the kernel, it was clear that crypto_larval_kill
was being called twice on the same larval, and the list_del() call was
doing bad things. When I fixed that, the refcnt bug became very
obvious. Here's the change I made for crypto_larval_kill:

@@ -161,7 +166,8 @@ void crypto_larval_kill(struct crypto_alg *alg)
struct crypto_larval *larval = (void *)alg;

down_write(&crypto_alg_sem);
- list_del(&alg->cra_list);
+ if (!list_empty(&alg->cra_list))
+ list_del_init(&alg->cra_list);
up_write(&crypto_alg_sem);
complete_all(&larval->completion);
crypto_alg_put(alg);

It seems that there are too many "put" calls (mixed between
crypto_alg_put() and crypto_mod_put(), which also calls
crypto_alg_put()). See this annotated portion of
crypto_alg_mod_lookup:

/* This can (correctly) return the same larval for two threads. */
larval = crypto_larval_lookup(name, type, mask);
if (IS_ERR(larval) || !crypto_is_larval(larval))
return larval;

ok = crypto_probing_notify(CRYPTO_MSG_ALG_REQUEST, larval);

if (ok == NOTIFY_STOP)
/* This calls crypto_mod_put(), but sometimes also get?? */
alg = crypto_larval_wait(larval);
else {
/* This directly calls crypto_mod_put */
crypto_mod_put(larval);
alg = ERR_PTR(-ENOENT);
}
/* This calls crypto_alg_put */
crypto_larval_kill(larval);
return alg;

In the two-thread situation, the first thread gets a larval with
refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the
larval via crypto_larval_add's call to __crypto_alg_lookup() and sees
the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread
decrements the ref count twice.

It seems to me like either each call to crypto_larval_lookup() should
result in a refcount bumped by two, or crypto_alg_mod_lookup() should
decrement only once, and the initial refcount should be 1 not 2 from
crypto_larval_add. But it's not clear to me which is sensible here.

What's the right solution here?

Thanks,

-Kees

--
Kees Cook
Chrome OS Security


2013-09-07 14:39:22

by Neil Horman

[permalink] [raw]
Subject: Re: race condition in crypto larval handling

On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote:
> Hi,
>
> I've tracked down a race condition and ref counting problem in the
> crypto API internals. We've been seeing it under Chrome OS, but it
> seems it's not isolated to just us:
>
> https://code.google.com/p/chromium/issues/detail?id=244581
> http://marc.info/?l=linux-crypto-vger&m=135429403609373&w=2
> https://bugzilla.redhat.com/show_bug.cgi?id=983682
> http://www.mail-archive.com/[email protected]/msg07933.html
>
> I think I've found the basic origin of the problem.
> crypto_larval_add() has synchronization to make sure only a single
> larval can ever be created. That logic seems totally fine. However,
> this means that crypto_larval_lookup() from two threads can return the
> same larval, which means that crypto_alg_mod_lookup() runs the risk of
> calling crypto_larval_kill() on the same larval twice, which does not
> appear to be handled sanely.
>
> I can easily crash the kernel by forcing a synchronization point just
> before the "return crypt_larval_add(...)" call in
> crypto_larval_lookup(). Basically, I added this sloppy sync code (I
> feel like there must be a better way to do this):
>
> +static atomic_t global_sync = ATOMIC_INIT(0);
> ...
> struct crypto_alg *crypto_larval_lookup(const char *name, u32 type, u32 mask)
> ...
> + if (strncmp(name, "asdf", 4) == 0) {
> + int value;
> +
> + value = atomic_add_return(1, &global_sync);
> + if (value == 1) {
> + /* I was first to stall here, wait for inc. */
> + while (atomic_read(&global_sync) != 2)
> + cpu_relax();
> + } else if (value == 2) {
> + /* I was second to stall here, wait for dec. */
> + while (atomic_read(&global_sync) != 1)
> + cpu_relax();
> + } else {
> + BUG();
> + }
> + atomic_dec(&global_sync);
> + }
>
> return crypto_larval_add(name, type, mask);
> }
>
> And then ran code from userspace that did, effectively:
>
> struct sockaddr_alg sa = {
> .salg_family = AF_ALG,
> .salg_type = "hash",
> };
> strcpy(sa.salg_name, argv[1]);
>
> fork();
> ...
> sds[0] = socket(AF_ALG, SOCK_SEQPACKET, 0);
> bind(sds[0], (struct sockaddr *) &sa, sizeof(sa));
>
> And ran this as "./tickle asdf1" to generate the two threads trying to
> find an invalid alg. The race looks possible even with valid algs, but
> this was the fastest path I could see to tickle the issue.
>
> With added printks to the kernel, it was clear that crypto_larval_kill
> was being called twice on the same larval, and the list_del() call was
> doing bad things. When I fixed that, the refcnt bug became very
> obvious. Here's the change I made for crypto_larval_kill:
>
> @@ -161,7 +166,8 @@ void crypto_larval_kill(struct crypto_alg *alg)
> struct crypto_larval *larval = (void *)alg;
>
> down_write(&crypto_alg_sem);
> - list_del(&alg->cra_list);
> + if (!list_empty(&alg->cra_list))
> + list_del_init(&alg->cra_list);
> up_write(&crypto_alg_sem);
> complete_all(&larval->completion);
> crypto_alg_put(alg);
>
> It seems that there are too many "put" calls (mixed between
> crypto_alg_put() and crypto_mod_put(), which also calls
> crypto_alg_put()). See this annotated portion of
> crypto_alg_mod_lookup:
>
> /* This can (correctly) return the same larval for two threads. */
> larval = crypto_larval_lookup(name, type, mask);
> if (IS_ERR(larval) || !crypto_is_larval(larval))
> return larval;
>
> ok = crypto_probing_notify(CRYPTO_MSG_ALG_REQUEST, larval);
>
> if (ok == NOTIFY_STOP)
> /* This calls crypto_mod_put(), but sometimes also get?? */
> alg = crypto_larval_wait(larval);
> else {
> /* This directly calls crypto_mod_put */
> crypto_mod_put(larval);
> alg = ERR_PTR(-ENOENT);
> }
> /* This calls crypto_alg_put */
> crypto_larval_kill(larval);
> return alg;
>
> In the two-thread situation, the first thread gets a larval with
> refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the
> larval via crypto_larval_add's call to __crypto_alg_lookup() and sees
> the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread
> decrements the ref count twice.
>
> It seems to me like either each call to crypto_larval_lookup() should
> result in a refcount bumped by two, or crypto_alg_mod_lookup() should
> decrement only once, and the initial refcount should be 1 not 2 from
> crypto_larval_add. But it's not clear to me which is sensible here.
>
> What's the right solution here?
>
> Thanks,
>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security
> --
> To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
I've been tracking this problem too:
https://bugzilla.redhat.com/show_bug.cgi?id=1002351

Though I'd just started so I've not gotten anywhere near this far. It seems,
given your analysis above that we really need to rework the refcounting here.
At the very least we probably need to:

1) only modify the module reference count when a larval for a given alg is
created, or when its last refcount is decremented.

2) fix up the cra_refcount to start at one and increment for each lookup, and
decrement for each kill.

Neil

2013-09-07 17:10:19

by Kees Cook

[permalink] [raw]
Subject: Re: race condition in crypto larval handling

On Sat, Sep 7, 2013 at 7:39 AM, Neil Horman <[email protected]> wrote:
> On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote:
>> Hi,
>>
>> I've tracked down a race condition and ref counting problem in the
>> crypto API internals. We've been seeing it under Chrome OS, but it
>> seems it's not isolated to just us:
>>
>> https://code.google.com/p/chromium/issues/detail?id=244581
>> http://marc.info/?l=linux-crypto-vger&m=135429403609373&w=2
>> https://bugzilla.redhat.com/show_bug.cgi?id=983682
>> http://www.mail-archive.com/[email protected]/msg07933.html
>>
>> I think I've found the basic origin of the problem.
>> crypto_larval_add() has synchronization to make sure only a single
>> larval can ever be created. That logic seems totally fine. However,
>> this means that crypto_larval_lookup() from two threads can return the
>> same larval, which means that crypto_alg_mod_lookup() runs the risk of
>> calling crypto_larval_kill() on the same larval twice, which does not
>> appear to be handled sanely.
>>
>> I can easily crash the kernel by forcing a synchronization point just
>> before the "return crypt_larval_add(...)" call in
>> crypto_larval_lookup(). Basically, I added this sloppy sync code (I
>> feel like there must be a better way to do this):
>>
>> +static atomic_t global_sync = ATOMIC_INIT(0);
>> ...
>> struct crypto_alg *crypto_larval_lookup(const char *name, u32 type, u32 mask)
>> ...
>> + if (strncmp(name, "asdf", 4) == 0) {
>> + int value;
>> +
>> + value = atomic_add_return(1, &global_sync);
>> + if (value == 1) {
>> + /* I was first to stall here, wait for inc. */
>> + while (atomic_read(&global_sync) != 2)
>> + cpu_relax();
>> + } else if (value == 2) {
>> + /* I was second to stall here, wait for dec. */
>> + while (atomic_read(&global_sync) != 1)
>> + cpu_relax();
>> + } else {
>> + BUG();
>> + }
>> + atomic_dec(&global_sync);
>> + }
>>
>> return crypto_larval_add(name, type, mask);
>> }
>>
>> And then ran code from userspace that did, effectively:
>>
>> struct sockaddr_alg sa = {
>> .salg_family = AF_ALG,
>> .salg_type = "hash",
>> };
>> strcpy(sa.salg_name, argv[1]);
>>
>> fork();
>> ...
>> sds[0] = socket(AF_ALG, SOCK_SEQPACKET, 0);
>> bind(sds[0], (struct sockaddr *) &sa, sizeof(sa));
>>
>> And ran this as "./tickle asdf1" to generate the two threads trying to
>> find an invalid alg. The race looks possible even with valid algs, but
>> this was the fastest path I could see to tickle the issue.
>>
>> With added printks to the kernel, it was clear that crypto_larval_kill
>> was being called twice on the same larval, and the list_del() call was
>> doing bad things. When I fixed that, the refcnt bug became very
>> obvious. Here's the change I made for crypto_larval_kill:
>>
>> @@ -161,7 +166,8 @@ void crypto_larval_kill(struct crypto_alg *alg)
>> struct crypto_larval *larval = (void *)alg;
>>
>> down_write(&crypto_alg_sem);
>> - list_del(&alg->cra_list);
>> + if (!list_empty(&alg->cra_list))
>> + list_del_init(&alg->cra_list);
>> up_write(&crypto_alg_sem);
>> complete_all(&larval->completion);
>> crypto_alg_put(alg);
>>
>> It seems that there are too many "put" calls (mixed between
>> crypto_alg_put() and crypto_mod_put(), which also calls
>> crypto_alg_put()). See this annotated portion of
>> crypto_alg_mod_lookup:
>>
>> /* This can (correctly) return the same larval for two threads. */
>> larval = crypto_larval_lookup(name, type, mask);
>> if (IS_ERR(larval) || !crypto_is_larval(larval))
>> return larval;
>>
>> ok = crypto_probing_notify(CRYPTO_MSG_ALG_REQUEST, larval);
>>
>> if (ok == NOTIFY_STOP)
>> /* This calls crypto_mod_put(), but sometimes also get?? */
>> alg = crypto_larval_wait(larval);
>> else {
>> /* This directly calls crypto_mod_put */
>> crypto_mod_put(larval);
>> alg = ERR_PTR(-ENOENT);
>> }
>> /* This calls crypto_alg_put */
>> crypto_larval_kill(larval);
>> return alg;
>>
>> In the two-thread situation, the first thread gets a larval with
>> refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the
>> larval via crypto_larval_add's call to __crypto_alg_lookup() and sees
>> the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread
>> decrements the ref count twice.
>>
>> It seems to me like either each call to crypto_larval_lookup() should
>> result in a refcount bumped by two, or crypto_alg_mod_lookup() should
>> decrement only once, and the initial refcount should be 1 not 2 from
>> crypto_larval_add. But it's not clear to me which is sensible here.
>>
>> What's the right solution here?
>>
>> Thanks,
>>
>> -Kees
>>
>> --
>> Kees Cook
>> Chrome OS Security
>>
> I've been tracking this problem too:
> https://bugzilla.redhat.com/show_bug.cgi?id=1002351
>
> Though I'd just started so I've not gotten anywhere near this far. It seems,
> given your analysis above that we really need to rework the refcounting here.
> At the very least we probably need to:
>
> 1) only modify the module reference count when a larval for a given alg is
> created, or when its last refcount is decremented.
>
> 2) fix up the cra_refcount to start at one and increment for each lookup, and
> decrement for each kill.

What I haven't been able to figure out is the "expected" behavior of a
larval. crypto_larval_kill() seems to be the only thing that removes
it from the alg list, so that seems like the only place a "put" should
be happening. Though that put should likely be the mod_put not the
alg_put. I don't think larval_wait should be doing a put, but it also
can perform a "get", so I'm baffled by that too. :)

-Kees

--
Kees Cook
Chrome OS Security

2013-09-08 01:32:20

by Herbert Xu

[permalink] [raw]
Subject: Re: race condition in crypto larval handling

On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote:
>
> In the two-thread situation, the first thread gets a larval with
> refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the
> larval via crypto_larval_add's call to __crypto_alg_lookup() and sees
> the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread
> decrements the ref count twice.
>
> It seems to me like either each call to crypto_larval_lookup() should
> result in a refcount bumped by two, or crypto_alg_mod_lookup() should
> decrement only once, and the initial refcount should be 1 not 2 from
> crypto_larval_add. But it's not clear to me which is sensible here.
>
> What's the right solution here?

First of all thanks a lot for tracking this problem down! It's
been bothering me for months but I was unable to find a good
reproducer.

So now that you've identified the problem, the solution is easy.
crypto_larval_lookup should only ever return a larval if it created
one. Any larval created earlier must be waited on first before we
return.

So does this patch fix the crash for you?

diff --git a/crypto/api.c b/crypto/api.c
index 320ea4d..a2b39c5 100644
--- a/crypto/api.c
+++ b/crypto/api.c
@@ -34,6 +34,8 @@ EXPORT_SYMBOL_GPL(crypto_alg_sem);
BLOCKING_NOTIFIER_HEAD(crypto_chain);
EXPORT_SYMBOL_GPL(crypto_chain);

+static struct crypto_alg *crypto_larval_wait(struct crypto_alg *alg);
+
struct crypto_alg *crypto_mod_get(struct crypto_alg *alg)
{
return try_module_get(alg->cra_module) ? crypto_alg_get(alg) : NULL;
@@ -144,8 +146,11 @@ static struct crypto_alg *crypto_larval_add(const char *name, u32 type,
}
up_write(&crypto_alg_sem);

- if (alg != &larval->alg)
+ if (alg != &larval->alg) {
kfree(larval);
+ if (crypto_is_larval(alg))
+ alg = crypto_larval_wait(alg);
+ }

return alg;
}

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2013-09-08 03:34:15

by Kees Cook

[permalink] [raw]
Subject: Re: race condition in crypto larval handling

On Sat, Sep 7, 2013 at 6:32 PM, Herbert Xu <[email protected]> wrote:
> On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote:
>>
>> In the two-thread situation, the first thread gets a larval with
>> refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the
>> larval via crypto_larval_add's call to __crypto_alg_lookup() and sees
>> the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread
>> decrements the ref count twice.
>>
>> It seems to me like either each call to crypto_larval_lookup() should
>> result in a refcount bumped by two, or crypto_alg_mod_lookup() should
>> decrement only once, and the initial refcount should be 1 not 2 from
>> crypto_larval_add. But it's not clear to me which is sensible here.
>>
>> What's the right solution here?
>
> First of all thanks a lot for tracking this problem down! It's
> been bothering me for months but I was unable to find a good
> reproducer.
>
> So now that you've identified the problem, the solution is easy.
> crypto_larval_lookup should only ever return a larval if it created
> one. Any larval created earlier must be waited on first before we
> return.
>
> So does this patch fix the crash for you?
>
> diff --git a/crypto/api.c b/crypto/api.c
> index 320ea4d..a2b39c5 100644
> --- a/crypto/api.c
> +++ b/crypto/api.c
> @@ -34,6 +34,8 @@ EXPORT_SYMBOL_GPL(crypto_alg_sem);
> BLOCKING_NOTIFIER_HEAD(crypto_chain);
> EXPORT_SYMBOL_GPL(crypto_chain);
>
> +static struct crypto_alg *crypto_larval_wait(struct crypto_alg *alg);
> +
> struct crypto_alg *crypto_mod_get(struct crypto_alg *alg)
> {
> return try_module_get(alg->cra_module) ? crypto_alg_get(alg) : NULL;
> @@ -144,8 +146,11 @@ static struct crypto_alg *crypto_larval_add(const char *name, u32 type,
> }
> up_write(&crypto_alg_sem);
>
> - if (alg != &larval->alg)
> + if (alg != &larval->alg) {
> kfree(larval);
> + if (crypto_is_larval(alg))
> + alg = crypto_larval_wait(alg);
> + }
>
> return alg;
> }

Thanks! This does stop the crash I was seeing on the "race with no
valid alg" path, yay!

However, I noticed on the "good" path (even without the above patch),
I sometimes see a double-kfree triggered by the modprobe process. I
can't, however, see how that's happening, since larval_destroy should
only be called when refcnt == 0.

Here's the debugging output I added that noticed this. pid 15797 and
15798 are my "hash" tool that forks before doing identical AF_ALG
calls for "sha512", which has to be modprobed. The trailing "31" and
"32" was an atomic counter I added just to make sure I wasn't going
crazy:

[ 112.495789] crypto_alg_mod_lookup 15797: looking up [sha512]
[ 112.495815] crypto_larval_lookup 15797: looking up [sha512]
[ 112.495839] crypto_larval_lookup 15797: requesting module [sha512]
[ 112.495915] crypto_alg_mod_lookup 15798: looking up [sha512]
[ 112.495937] crypto_larval_lookup 15798: looking up [sha512]
[ 112.495960] crypto_larval_lookup 15798: requesting module [sha512]
[ 112.498691] crypto_larval_destroy 15808(modprobe): larval
kfree(ffff880123e9da00) refcnt:0 (serial:31)
[ 112.498771] crypto_larval_destroy 15808(modprobe): larval
kfree(ffff880123e9da00) refcnt:0 (serial:32)
[ 112.500888] crypto_larval_lookup 15797: after requesting module
[sha512], got alg ffffffffc0299050
[ 112.500904] crypto_larval_lookup 15797: for [sha512], have alg
ffffffffc0299050
[ 112.500934] crypto_larval_lookup 15797: for [sha512], alg
ffffffffc0299050 is NOT larval
[ 112.500953] crypto_alg_mod_lookup 15797: [sha512] is not larval
ffffffffc0299050
[ 112.501384] crypto_larval_lookup 15798: after requesting module
[sha512], got alg ffffffffc0299050
[ 112.501404] crypto_larval_lookup 15798: for [sha512], have alg
ffffffffc0299050
[ 112.501417] crypto_larval_lookup 15798: for [sha512], alg
ffffffffc0299050 is NOT larval
[ 112.501432] crypto_alg_mod_lookup 15798: [sha512] is not larval
ffffffffc0299050

Regardless, this seems to be a separate bug.

-Kees

--
Kees Cook
Chrome OS Security

2013-09-08 04:37:18

by Herbert Xu

[permalink] [raw]
Subject: Re: race condition in crypto larval handling

On Sat, Sep 07, 2013 at 08:34:15PM -0700, Kees Cook wrote:
>
> However, I noticed on the "good" path (even without the above patch),
> I sometimes see a double-kfree triggered by the modprobe process. I
> can't, however, see how that's happening, since larval_destroy should
> only be called when refcnt == 0.

Do you still see this double free with this patch? Without the
patch it is completely expected as killing the same lavral twice
will cause memory corruption leading to all sorts of weirdness,
even if you stop it from deleting the list entry twice.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2013-09-08 04:54:22

by Herbert Xu

[permalink] [raw]
Subject: Re: race condition in crypto larval handling

On Sun, Sep 08, 2013 at 02:37:03PM +1000, Herbert Xu wrote:
> On Sat, Sep 07, 2013 at 08:34:15PM -0700, Kees Cook wrote:
> >
> > However, I noticed on the "good" path (even without the above patch),
> > I sometimes see a double-kfree triggered by the modprobe process. I
> > can't, however, see how that's happening, since larval_destroy should
> > only be called when refcnt == 0.
>
> Do you still see this double free with this patch? Without the
> patch it is completely expected as killing the same lavral twice
> will cause memory corruption leading to all sorts of weirdness,
> even if you stop it from deleting the list entry twice.

Actually I know what it is. sha512 registers two algorithms.
Therefore, it will create two larvals in sequence and then destroy
them in turn. So it's not a double free at all. If you put a
printk in crypto_larval_alloc that should confirm this.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2013-09-08 06:01:20

by Kees Cook

[permalink] [raw]
Subject: Re: race condition in crypto larval handling

On Sat, Sep 7, 2013 at 9:54 PM, Herbert Xu <[email protected]> wrote:
> On Sun, Sep 08, 2013 at 02:37:03PM +1000, Herbert Xu wrote:
>> On Sat, Sep 07, 2013 at 08:34:15PM -0700, Kees Cook wrote:
>> >
>> > However, I noticed on the "good" path (even without the above patch),
>> > I sometimes see a double-kfree triggered by the modprobe process. I
>> > can't, however, see how that's happening, since larval_destroy should
>> > only be called when refcnt == 0.
>>
>> Do you still see this double free with this patch? Without the
>> patch it is completely expected as killing the same lavral twice
>> will cause memory corruption leading to all sorts of weirdness,
>> even if you stop it from deleting the list entry twice.

I noticed while testing the larval_kill fix, and then tried it again
after reverting the fix -- both showed the behavior.

> Actually I know what it is. sha512 registers two algorithms.
> Therefore, it will create two larvals in sequence and then destroy
> them in turn. So it's not a double free at all. If you put a
> printk in crypto_larval_alloc that should confirm this.

Ah! That would make sense; it just happens to re-allocate to the exact
same location, yes. Whew, that's certainly what's happening. I can
retest to confirm in my morning.

Thanks again for the larval_kill fix! I'll get it rolled out for wider
testing to confirm that it make our crash numbers go down.

-Kees

--
Kees Cook
Chrome OS Security

2013-09-08 15:52:12

by Kees Cook

[permalink] [raw]
Subject: Re: race condition in crypto larval handling

On Sat, Sep 7, 2013 at 11:01 PM, Kees Cook <[email protected]> wrote:
> On Sat, Sep 7, 2013 at 9:54 PM, Herbert Xu <[email protected]> wrote:
>> On Sun, Sep 08, 2013 at 02:37:03PM +1000, Herbert Xu wrote:
>>> On Sat, Sep 07, 2013 at 08:34:15PM -0700, Kees Cook wrote:
>>> >
>>> > However, I noticed on the "good" path (even without the above patch),
>>> > I sometimes see a double-kfree triggered by the modprobe process. I
>>> > can't, however, see how that's happening, since larval_destroy should
>>> > only be called when refcnt == 0.
>>>
>>> Do you still see this double free with this patch? Without the
>>> patch it is completely expected as killing the same lavral twice
>>> will cause memory corruption leading to all sorts of weirdness,
>>> even if you stop it from deleting the list entry twice.
>>
>> Actually I know what it is. sha512 registers two algorithms.
>> Therefore, it will create two larvals in sequence and then destroy
>> them in turn. So it's not a double free at all. If you put a
>> printk in crypto_larval_alloc that should confirm this.
>
> Ah! That would make sense; it just happens to re-allocate to the exact
> same location, yes. Whew, that's certainly what's happening. I can
> retest to confirm in my morning.

Confirmed: 2 allocs happen, and then 2 kfrees. :)

-Kees

--
Kees Cook
Chrome OS Security

2013-09-09 01:41:25

by James Yonan

[permalink] [raw]
Subject: Re: race condition in crypto larval handling

On 07/09/2013 19:32, Herbert Xu wrote:
> On Fri, Sep 06, 2013 at 04:20:50PM -0700, Kees Cook wrote:
>>
>> In the two-thread situation, the first thread gets a larval with
>> refcnt 2 via crypto_larval_add. (Why 2?) The next thread finds the
>> larval via crypto_larval_add's call to __crypto_alg_lookup() and sees
>> the ref bump to 3. While exiting crypto_alg_mod_lookup, each thread
>> decrements the ref count twice.
>>
>> It seems to me like either each call to crypto_larval_lookup() should
>> result in a refcount bumped by two, or crypto_alg_mod_lookup() should
>> decrement only once, and the initial refcount should be 1 not 2 from
>> crypto_larval_add. But it's not clear to me which is sensible here.
>>
>> What's the right solution here?
>
> First of all thanks a lot for tracking this problem down! It's
> been bothering me for months but I was unable to find a good
> reproducer.
>
> So now that you've identified the problem, the solution is easy.
> crypto_larval_lookup should only ever return a larval if it created
> one. Any larval created earlier must be waited on first before we
> return.
>
> So does this patch fix the crash for you?
>
> diff --git a/crypto/api.c b/crypto/api.c
> index 320ea4d..a2b39c5 100644
> --- a/crypto/api.c
> +++ b/crypto/api.c
> @@ -34,6 +34,8 @@ EXPORT_SYMBOL_GPL(crypto_alg_sem);
> BLOCKING_NOTIFIER_HEAD(crypto_chain);
> EXPORT_SYMBOL_GPL(crypto_chain);
>
> +static struct crypto_alg *crypto_larval_wait(struct crypto_alg *alg);
> +
> struct crypto_alg *crypto_mod_get(struct crypto_alg *alg)
> {
> return try_module_get(alg->cra_module) ? crypto_alg_get(alg) : NULL;
> @@ -144,8 +146,11 @@ static struct crypto_alg *crypto_larval_add(const char *name, u32 type,
> }
> up_write(&crypto_alg_sem);
>
> - if (alg != &larval->alg)
> + if (alg != &larval->alg) {
> kfree(larval);
> + if (crypto_is_larval(alg))
> + alg = crypto_larval_wait(alg);
> + }
>
> return alg;
> }

I tried this patch, but I still see an apparent module lookup/load race
if code on several CPUs calls crypto_alloc_aead at the same time, and an
external module such as aes needs to be loaded.

Seeing this in the log: "request_module: runaway loop modprobe gcm(aes)"

Shouldn't module lookup/load be bracketed by some sort of lock to
prevent this?

James

2013-09-10 23:13:05

by Herbert Xu

[permalink] [raw]
Subject: Re: race condition in crypto larval handling

Please don't trim cc lists!

James Yonan <[email protected]> wrote:
>
> I tried this patch, but I still see an apparent module lookup/load race
> if code on several CPUs calls crypto_alloc_aead at the same time, and an
> external module such as aes needs to be loaded.
>
> Seeing this in the log: "request_module: runaway loop modprobe gcm(aes)"
>
> Shouldn't module lookup/load be bracketed by some sort of lock to
> prevent this?

This sounds like a different problem. Larvals are only used to
control the instantiation of templates. The loading of modules
occur prior to that so you can certainly have multiple loads occur
if multiple threads are trying to allocate the same algorithm.

This should be harmless since gcm(aes) doesn't exist anyway so
regardless of whether we load the module the lookup will fail and
we will go through the larval code path.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt