LinuxLists.cc - Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately

2014-06-02 04:21:18

Subject: Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately

On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
> On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
> > On Fri, 30 May 2014, Vladimir Davydov wrote:
> >
> > > (3) is a bit more difficult, because slabs are added to per-cpu partial
> > > lists lock-less. Fortunately, we only have to handle the __slab_free
> > > case, because, as there shouldn't be any allocation requests dispatched
> > > to a dead memcg cache, get_partial_node() should never be called. In
> > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
> > > put_cpu_partial) so that setting ->partial to a special value, which
> > > will make put_cpu_partial bail out, will do the trick.
> > >
> > > Note, this shouldn't affect performance, because keeping empty slabs on
> > > per node lists as well as using per cpu partials are only worthwhile if
> > > the cache is used for allocations, which isn't the case for dead caches.
> >
> > This all sounds pretty good to me but we still have some pretty extensive
> > modifications that I would rather avoid.
> >
> > In put_cpu_partial you can simply check that the memcg is dead right? This
> > would avoid all the other modifications I would think and will not require
> > a special value for the per cpu partial pointer.
>
> That would be racy. The check if memcg is dead and the write to per cpu
> partial ptr wouldn't proceed as one atomic operation. If we set the dead
> flag from another thread between these two operations, put_cpu_partial
> will add a slab to a per cpu partial list *after* the cache was zapped.

Hello, Vladimir.

I think that we can do (3) easily.
If we check memcg_cache_dead() in the end of put_cpu_partial() rather
than in the begin of put_cpu_partial(), we can avoid the race you
mentioned. If someone do put_cpu_partial() before dead flag is set,
it can be zapped by who set dead flag. And if someone do
put_cpu_partial() after dead flag is set, it can be zapped by who
do put_cpu_partial().

Thanks.

2014-06-02 11:48:00

by Vladimir Davydov

[permalink] [raw]

Subject: Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately

Hi Joonsoo,

On Mon, Jun 02, 2014 at 01:24:36PM +0900, Joonsoo Kim wrote:
> On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
> > On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
> > > On Fri, 30 May 2014, Vladimir Davydov wrote:
> > >
> > > > (3) is a bit more difficult, because slabs are added to per-cpu partial
> > > > lists lock-less. Fortunately, we only have to handle the __slab_free
> > > > case, because, as there shouldn't be any allocation requests dispatched
> > > > to a dead memcg cache, get_partial_node() should never be called. In
> > > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
> > > > put_cpu_partial) so that setting ->partial to a special value, which
> > > > will make put_cpu_partial bail out, will do the trick.
[...]
> I think that we can do (3) easily.
> If we check memcg_cache_dead() in the end of put_cpu_partial() rather
> than in the begin of put_cpu_partial(), we can avoid the race you
> mentioned. If someone do put_cpu_partial() before dead flag is set,
> it can be zapped by who set dead flag. And if someone do
> put_cpu_partial() after dead flag is set, it can be zapped by who
> do put_cpu_partial().

After put_cpu_partial() adds a frozen slab to a per cpu partial list,
the slab becomes visible to other threads, which means it can be
unfrozen and freed. The latter can trigger cache destruction. Hence we
shouldn't touch the cache, in particular call memcg_cache_dead() on it,
after calling put_cpu_partial(), otherwise we can get use-after-free.

However, what you propose makes sense if we disable irqs before adding a
slab to a partial list and enable them only after checking if the cache
is dead and unfreezing all partials if so, i.e.

diff --git a/mm/slub.c b/mm/slub.c
index d96faa2464c3..14b9e9a8677c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2030,8 +2030,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
struct page *oldpage;
int pages;
int pobjects;
+ unsigned long flags;
+ int irq_saved = 0;

do {
+ if (irq_saved) {
+ local_irq_restore(flags);
+ irq_saved = 0;
+ }
+
pages = 0;
pobjects = 0;
oldpage = this_cpu_read(s->cpu_slab->partial);
@@ -2062,8 +2069,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
page->pobjects = pobjects;
page->next = oldpage;

+ local_irq_save(flags);
+ irq_saved = 1;
+
} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
!= oldpage);
+
+ if (memcg_cache_dead(s))
+ unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+
+ local_irq_restore(flags);
#endif
}

That would be safe against possible cache destruction, because to remove
a slab from a per cpu partial list we have to run on the cpu it was
frozen on. Disabling irqs makes it impossible.

Christoph,

Does it look better to you? BTW, why can't we *always* disable irqs for
the whole put_cpu_partial()? That way handling dead caches there would
be trivial, and we wouldn't have to use this_cpu_cmpxchg().

Thanks.

2014-06-02 14:03:54

by Joonsoo Kim

[permalink] [raw]

Subject: Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately

2014-06-02 20:47 GMT+09:00 Vladimir Davydov <[email protected]>:
> Hi Joonsoo,
>
> On Mon, Jun 02, 2014 at 01:24:36PM +0900, Joonsoo Kim wrote:
>> On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
>> > On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
>> > > On Fri, 30 May 2014, Vladimir Davydov wrote:
>> > >
>> > > > (3) is a bit more difficult, because slabs are added to per-cpu partial
>> > > > lists lock-less. Fortunately, we only have to handle the __slab_free
>> > > > case, because, as there shouldn't be any allocation requests dispatched
>> > > > to a dead memcg cache, get_partial_node() should never be called. In
>> > > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
>> > > > put_cpu_partial) so that setting ->partial to a special value, which
>> > > > will make put_cpu_partial bail out, will do the trick.
> [...]
>> I think that we can do (3) easily.
>> If we check memcg_cache_dead() in the end of put_cpu_partial() rather
>> than in the begin of put_cpu_partial(), we can avoid the race you
>> mentioned. If someone do put_cpu_partial() before dead flag is set,
>> it can be zapped by who set dead flag. And if someone do
>> put_cpu_partial() after dead flag is set, it can be zapped by who
>> do put_cpu_partial().
>
> After put_cpu_partial() adds a frozen slab to a per cpu partial list,
> the slab becomes visible to other threads, which means it can be
> unfrozen and freed. The latter can trigger cache destruction. Hence we
> shouldn't touch the cache, in particular call memcg_cache_dead() on it,
> after calling put_cpu_partial(), otherwise we can get use-after-free.
>
> However, what you propose makes sense if we disable irqs before adding a
> slab to a partial list and enable them only after checking if the cache
> is dead and unfreezing all partials if so, i.e.
>
> diff --git a/mm/slub.c b/mm/slub.c
> index d96faa2464c3..14b9e9a8677c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2030,8 +2030,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> struct page *oldpage;
> int pages;
> int pobjects;
> + unsigned long flags;
> + int irq_saved = 0;
>
> do {
> + if (irq_saved) {
> + local_irq_restore(flags);
> + irq_saved = 0;
> + }
> +
> pages = 0;
> pobjects = 0;
> oldpage = this_cpu_read(s->cpu_slab->partial);
> @@ -2062,8 +2069,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> page->pobjects = pobjects;
> page->next = oldpage;
>
> + local_irq_save(flags);
> + irq_saved = 1;
> +
> } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
> != oldpage);
> +
> + if (memcg_cache_dead(s))
> + unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
> +
> + local_irq_restore(flags);
> #endif
> }
>
>
> That would be safe against possible cache destruction, because to remove
> a slab from a per cpu partial list we have to run on the cpu it was
> frozen on. Disabling irqs makes it impossible.

Hmm... this is also a bit ugly.
How about following change?

Thanks.

diff --git a/mm/slub.c b/mm/slub.c
index 2b1ce69..6adab87 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2058,6 +2058,21 @@ static void put_cpu_partial(struct kmem_cache
*s, struct page *page, int drain)

} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
!= oldpage);
+
+ if (memcg_cache_dead(s)) {
+ bool done = false;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ if (this_cpu_read(s->cpu_slab->partial) == page) {
+ done = true;
+ unfreeze_partials(s, this_cpu_ptr);
+ }
+ local_irq_restore(flags);
+
+ if (!done)
+ flush_all(s);
+ }
#endif
}

2014-06-02 15:17:36

by Christoph Lameter (Ampere)

[permalink] [raw]

Subject: Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately

On Mon, 2 Jun 2014, Joonsoo Kim wrote:

> Hmm... this is also a bit ugly.
> How about following change?

That looks much cleaner.

2014-06-03 08:17:18

by Vladimir Davydov

[permalink] [raw]

Subject: Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately

On Mon, Jun 02, 2014 at 11:03:51PM +0900, Joonsoo Kim wrote:
> 2014-06-02 20:47 GMT+09:00 Vladimir Davydov <[email protected]>:
> > Hi Joonsoo,
> >
> > On Mon, Jun 02, 2014 at 01:24:36PM +0900, Joonsoo Kim wrote:
> >> On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
> >> > On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
> >> > > On Fri, 30 May 2014, Vladimir Davydov wrote:
> >> > >
> >> > > > (3) is a bit more difficult, because slabs are added to per-cpu partial
> >> > > > lists lock-less. Fortunately, we only have to handle the __slab_free
> >> > > > case, because, as there shouldn't be any allocation requests dispatched
> >> > > > to a dead memcg cache, get_partial_node() should never be called. In
> >> > > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
> >> > > > put_cpu_partial) so that setting ->partial to a special value, which
> >> > > > will make put_cpu_partial bail out, will do the trick.
> > [...]
> >> I think that we can do (3) easily.
> >> If we check memcg_cache_dead() in the end of put_cpu_partial() rather
> >> than in the begin of put_cpu_partial(), we can avoid the race you
> >> mentioned. If someone do put_cpu_partial() before dead flag is set,
> >> it can be zapped by who set dead flag. And if someone do
> >> put_cpu_partial() after dead flag is set, it can be zapped by who
> >> do put_cpu_partial().
> >
> > After put_cpu_partial() adds a frozen slab to a per cpu partial list,
> > the slab becomes visible to other threads, which means it can be
> > unfrozen and freed. The latter can trigger cache destruction. Hence we
> > shouldn't touch the cache, in particular call memcg_cache_dead() on it,
> > after calling put_cpu_partial(), otherwise we can get use-after-free.
> >
> > However, what you propose makes sense if we disable irqs before adding a
> > slab to a partial list and enable them only after checking if the cache
> > is dead and unfreezing all partials if so, i.e.
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index d96faa2464c3..14b9e9a8677c 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2030,8 +2030,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> > struct page *oldpage;
> > int pages;
> > int pobjects;
> > + unsigned long flags;
> > + int irq_saved = 0;
> >
> > do {
> > + if (irq_saved) {
> > + local_irq_restore(flags);
> > + irq_saved = 0;
> > + }
> > +
> > pages = 0;
> > pobjects = 0;
> > oldpage = this_cpu_read(s->cpu_slab->partial);
> > @@ -2062,8 +2069,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> > page->pobjects = pobjects;
> > page->next = oldpage;
> >
> > + local_irq_save(flags);
> > + irq_saved = 1;
> > +
> > } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
> > != oldpage);
> > +
> > + if (memcg_cache_dead(s))
> > + unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
> > +
> > + local_irq_restore(flags);
> > #endif
> > }
> >
> >
> > That would be safe against possible cache destruction, because to remove
> > a slab from a per cpu partial list we have to run on the cpu it was
> > frozen on. Disabling irqs makes it impossible.
>
> Hmm... this is also a bit ugly.
> How about following change?
>
> Thanks.
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 2b1ce69..6adab87 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2058,6 +2058,21 @@ static void put_cpu_partial(struct kmem_cache
> *s, struct page *page, int drain)
>
> } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
> != oldpage);
> +
> + if (memcg_cache_dead(s)) {
> + bool done = false;
> + unsigned long flags;

Suppose we are preempted here. In the meanwhile all objects are freed to
the cache, all frozen pages are unfrozen and also freed. The cache
destruction is then scheduled (patch 2 of this set). Then when this
thread continues execution it will operate on the cache that was
destroyed - use-after-free.

I admit, this is very unlikely, but can we ignore this possibility?

Thanks.

> +
> + local_irq_save(flags);
> + if (this_cpu_read(s->cpu_slab->partial) == page) {
> + done = true;
> + unfreeze_partials(s, this_cpu_ptr);
> + }
> + local_irq_restore(flags);
> +
> + if (!done)
> + flush_all(s);
> + }
> #endif
> }

2014-06-04 08:53:35

by Joonsoo Kim

[permalink] [raw]

Subject: Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately

2014-06-03 17:16 GMT+09:00 Vladimir Davydov <[email protected]>:
> On Mon, Jun 02, 2014 at 11:03:51PM +0900, Joonsoo Kim wrote:
>> 2014-06-02 20:47 GMT+09:00 Vladimir Davydov <[email protected]>:
>> > Hi Joonsoo,
>> >
>> > On Mon, Jun 02, 2014 at 01:24:36PM +0900, Joonsoo Kim wrote:
>> >> On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
>> >> > On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
>> >> > > On Fri, 30 May 2014, Vladimir Davydov wrote:
>> >> > >
>> >> > > > (3) is a bit more difficult, because slabs are added to per-cpu partial
>> >> > > > lists lock-less. Fortunately, we only have to handle the __slab_free
>> >> > > > case, because, as there shouldn't be any allocation requests dispatched
>> >> > > > to a dead memcg cache, get_partial_node() should never be called. In
>> >> > > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
>> >> > > > put_cpu_partial) so that setting ->partial to a special value, which
>> >> > > > will make put_cpu_partial bail out, will do the trick.
>> > [...]
>> >> I think that we can do (3) easily.
>> >> If we check memcg_cache_dead() in the end of put_cpu_partial() rather
>> >> than in the begin of put_cpu_partial(), we can avoid the race you
>> >> mentioned. If someone do put_cpu_partial() before dead flag is set,
>> >> it can be zapped by who set dead flag. And if someone do
>> >> put_cpu_partial() after dead flag is set, it can be zapped by who
>> >> do put_cpu_partial().
>> >
>> > After put_cpu_partial() adds a frozen slab to a per cpu partial list,
>> > the slab becomes visible to other threads, which means it can be
>> > unfrozen and freed. The latter can trigger cache destruction. Hence we
>> > shouldn't touch the cache, in particular call memcg_cache_dead() on it,
>> > after calling put_cpu_partial(), otherwise we can get use-after-free.
>> >
>> > However, what you propose makes sense if we disable irqs before adding a
>> > slab to a partial list and enable them only after checking if the cache
>> > is dead and unfreezing all partials if so, i.e.
>> >
>> > diff --git a/mm/slub.c b/mm/slub.c
>> > index d96faa2464c3..14b9e9a8677c 100644
>> > --- a/mm/slub.c
>> > +++ b/mm/slub.c
>> > @@ -2030,8 +2030,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>> > struct page *oldpage;
>> > int pages;
>> > int pobjects;
>> > + unsigned long flags;
>> > + int irq_saved = 0;
>> >
>> > do {
>> > + if (irq_saved) {
>> > + local_irq_restore(flags);
>> > + irq_saved = 0;
>> > + }
>> > +
>> > pages = 0;
>> > pobjects = 0;
>> > oldpage = this_cpu_read(s->cpu_slab->partial);
>> > @@ -2062,8 +2069,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>> > page->pobjects = pobjects;
>> > page->next = oldpage;
>> >
>> > + local_irq_save(flags);
>> > + irq_saved = 1;
>> > +
>> > } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>> > != oldpage);
>> > +
>> > + if (memcg_cache_dead(s))
>> > + unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
>> > +
>> > + local_irq_restore(flags);
>> > #endif
>> > }
>> >
>> >
>> > That would be safe against possible cache destruction, because to remove
>> > a slab from a per cpu partial list we have to run on the cpu it was
>> > frozen on. Disabling irqs makes it impossible.
>>
>> Hmm... this is also a bit ugly.
>> How about following change?
>>
>> Thanks.
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 2b1ce69..6adab87 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2058,6 +2058,21 @@ static void put_cpu_partial(struct kmem_cache
>> *s, struct page *page, int drain)
>>
>> } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>> != oldpage);
>> +
>> + if (memcg_cache_dead(s)) {
>> + bool done = false;
>> + unsigned long flags;
>
> Suppose we are preempted here. In the meanwhile all objects are freed to
> the cache, all frozen pages are unfrozen and also freed. The cache
> destruction is then scheduled (patch 2 of this set). Then when this
> thread continues execution it will operate on the cache that was
> destroyed - use-after-free.
>
> I admit, this is very unlikely, but can we ignore this possibility?
>

Hello,

>From your comment, now, I realize that your cache destruction solution
has severe problem.

With you solution, kmem_cache can be destroyed before last kfree() caller
has returned. It means that we can't safely do anything related to
the kmem_cache after losing control about that slab where we try to free
object in free path.

Consider __slab_free(). After put_cpu_partial() in __slab_free() is called,
we attempt to update stat. There is possibility that this operation could be
use-after-free with your solution. Until now, we have just stat operation, but
it could be more. I don't like to impose this constraint to the slab free path.
So IMHO, it is better that we should defer to destroy kmem_cache
until last kfree() caller returns. Is it fair enough? :)

Thanks.

2014-06-04 09:47:54

by Vladimir Davydov

[permalink] [raw]

Subject: Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately

On Wed, Jun 04, 2014 at 05:53:29PM +0900, Joonsoo Kim wrote:
> Consider __slab_free(). After put_cpu_partial() in __slab_free() is called,
> we attempt to update stat. There is possibility that this operation could be
> use-after-free with your solution. Until now, we have just stat operation, but
> it could be more. I don't like to impose this constraint to the slab free path.

We can move stats update before object free I guess, but I admit this is
not going to be a flexible solution, because every future modifications
to slab_free should be done with great care then, otherwise it may break
things.

> So IMHO, it is better that we should defer to destroy kmem_cache
> until last kfree() caller returns. Is it fair enough? :)

Actually, I was thinking about it (even discussed with Christoph), but
the problem is that there is currently no way to wait for all currently
executing kfree's to complete, because SLUB's version can be preempted
at any time.

One way to solve this is to make slab_free non-preemptable and call
synchronize_sched before kmem_cache_destroy (or use call_rcu_sched).
When I started to implement this approach I found the resulting code a
bit ugly. Also, Christoph had some concerns about it (see
https://lkml.org/lkml/2014/5/23/524). That's why I tried to go with this
patch set first, but that doesn't mean that I'm 100% sure in it :-) I'll
send the implementations of the other approach (with prempt_disable)
soon.

Thanks.