LinuxLists.cc - [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu

2018-11-17 01:34:57

Subject: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

The this_cpu_cmpxchg makes the do-while loop pass as long as the
s->cpu_slab->partial as the same value. It doesn't care what happened to
that slab. Interrupt is not disabled, and new alloc/free can happen in the
interrupt handlers. Theoretically, after we have a reference to the it,
stored in _oldpage_, the first slab on the partial list on this CPU can be
moved to kmem_cache_node and then moved to different kmem_cache_cpu and
then somehow can be added back as head to partial list of current
kmem_cache_cpu, though that is a very rare case. If that rare case really
happened, the reading of oldpage->pobjects may get a 0xdead0000
unexpectedly, stored in _pobjects_, if the reading happens just after
another CPU removed the slab from kmem_cache_node, setting lru.prev to
LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
prevents slabs from being moved to kmem_cache_node and being finally freed.

We see in a vmcore, there are 375210 slabs kept in the partial list of one
kmem_cache_cpu, but only 305 in-use objects in the same list for
kmalloc-2048 cache. We see negative values for page.pobjects, the last page
with negative _pobjects_ has the value of 0xdead0004, the next page looks
good (_pobjects is 1).

For the fix, I wanted to call this_cpu_cmpxchg_double with
oldpage->pobjects, but failed due to size difference between
oldpage->pobjects and cpu_slab->partial. So I changed to call
this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
happen in between, but just want to make sure the first slab did expereince
a remove and re-add. This patch is more to call for ideas.

Signed-off-by: Wengang Wang <[email protected]>
---
mm/slub.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index e3629cd..26539e6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2248,6 +2248,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
{
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct page *oldpage;
+ unsigned long tid;
int pages;
int pobjects;

@@ -2255,8 +2256,12 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
do {
pages = 0;
pobjects = 0;
- oldpage = this_cpu_read(s->cpu_slab->partial);

+ tid = this_cpu_read(s->cpu_slab->tid);
+ /* read tid before reading oldpage */
+ barrier();
+
+ oldpage = this_cpu_read(s->cpu_slab->partial);
if (oldpage) {
pobjects = oldpage->pobjects;
pages = oldpage->pages;
@@ -2283,8 +2288,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
page->pobjects = pobjects;
page->next = oldpage;

- } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
- != oldpage);
+ /* we dont' change tid, but want to make sure it didn't change
+ * in between. We don't really hope alloc/free not happen on
+ * this CPU, but don't want the first slab be removed from and
+ * then re-added as head to this partial list. If that case
+ * happened, pobjects may read 0xdead0000 when this slab is just
+ * removed from kmem_cache_node by other CPU setting lru.prev
+ * to LIST_POISON2.
+ */
+ } while (this_cpu_cmpxchg_double(s->cpu_slab->partial, s->cpu_slab->tid,
+ oldpage, tid, page, tid) == 0);
+
if (unlikely(!s->cpu_partial)) {
unsigned long flags;

--
2.9.5

2018-11-17 02:52:32

by Wei Yang

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

On Fri, Nov 16, 2018 at 05:33:35PM -0800, Wengang Wang wrote:
>The this_cpu_cmpxchg makes the do-while loop pass as long as the
>s->cpu_slab->partial as the same value. It doesn't care what happened to
>that slab. Interrupt is not disabled, and new alloc/free can happen in the
>interrupt handlers. Theoretically, after we have a reference to the it,
>stored in _oldpage_, the first slab on the partial list on this CPU can be
>moved to kmem_cache_node and then moved to different kmem_cache_cpu and
>then somehow can be added back as head to partial list of current
>kmem_cache_cpu, though that is a very rare case. If that rare case really

I didn't fully catch up with this case.

When put_cpu_partial() is called, this means we are trying to freeze an
frozen page and this pages is fully occupied. Since page->freelist is
NULL.

A full page is supposed to be on no where when has_cpu_partial() is
true.

So I don't understand when it will be moved to different kmem_cache_cpu.

>happened, the reading of oldpage->pobjects may get a 0xdead0000
>unexpectedly, stored in _pobjects_, if the reading happens just after
>another CPU removed the slab from kmem_cache_node, setting lru.prev to
>LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
>prevents slabs from being moved to kmem_cache_node and being finally freed.

Looks this page is removed from some list. This happens in which case? I
mean the page is previouly on which list?

>
>We see in a vmcore, there are 375210 slabs kept in the partial list of one
>kmem_cache_cpu, but only 305 in-use objects in the same list for
>kmalloc-2048 cache. We see negative values for page.pobjects, the last page
>with negative _pobjects_ has the value of 0xdead0004, the next page looks
>good (_pobjects is 1).
>
>For the fix, I wanted to call this_cpu_cmpxchg_double with
>oldpage->pobjects, but failed due to size difference between
>oldpage->pobjects and cpu_slab->partial. So I changed to call
>this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
>happen in between, but just want to make sure the first slab did expereince
>a remove and re-add. This patch is more to call for ideas.
>
>Signed-off-by: Wengang Wang <[email protected]>
>---
> mm/slub.c | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
>diff --git a/mm/slub.c b/mm/slub.c
>index e3629cd..26539e6 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -2248,6 +2248,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> {
> #ifdef CONFIG_SLUB_CPU_PARTIAL
> struct page *oldpage;
>+ unsigned long tid;
> int pages;
> int pobjects;
>
>@@ -2255,8 +2256,12 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> do {
> pages = 0;
> pobjects = 0;
>- oldpage = this_cpu_read(s->cpu_slab->partial);
>
>+ tid = this_cpu_read(s->cpu_slab->tid);
>+ /* read tid before reading oldpage */
>+ barrier();
>+
>+ oldpage = this_cpu_read(s->cpu_slab->partial);
> if (oldpage) {
> pobjects = oldpage->pobjects;
> pages = oldpage->pages;
>@@ -2283,8 +2288,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> page->pobjects = pobjects;
> page->next = oldpage;
>
>- } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>- != oldpage);
>+ /* we dont' change tid, but want to make sure it didn't change
>+ * in between. We don't really hope alloc/free not happen on
>+ * this CPU, but don't want the first slab be removed from and
>+ * then re-added as head to this partial list. If that case
>+ * happened, pobjects may read 0xdead0000 when this slab is just
>+ * removed from kmem_cache_node by other CPU setting lru.prev
>+ * to LIST_POISON2.
>+ */
>+ } while (this_cpu_cmpxchg_double(s->cpu_slab->partial, s->cpu_slab->tid,
>+ oldpage, tid, page, tid) == 0);
>+
> if (unlikely(!s->cpu_partial)) {
> unsigned long flags;
>
>--
>2.9.5

--
Wei Yang
Help you, Help me

2018-11-18 01:07:58

by Wei Yang

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

On Fri, Nov 16, 2018 at 05:33:35PM -0800, Wengang Wang wrote:
>The this_cpu_cmpxchg makes the do-while loop pass as long as the
>s->cpu_slab->partial as the same value. It doesn't care what happened to
>that slab. Interrupt is not disabled, and new alloc/free can happen in the

Well, I seems to understand your description.

There are two slabs

* one which put_cpu_partial() trying to free an object
* one which is the first slab in cpu_partial list

There is some tricky case, the first slab in cpu_partial list we
reference to will change since interrupt is not disabled.

>interrupt handlers. Theoretically, after we have a reference to the it,

^^^
one more word?

>stored in _oldpage_, the first slab on the partial list on this CPU can be

^^^
One little suggestion here, mayby use cpu_partial would be more easy to
understand. I confused this with the partial list in kmem_cache_node at
the first time. :-)

>moved to kmem_cache_node and then moved to different kmem_cache_cpu and
>then somehow can be added back as head to partial list of current
>kmem_cache_cpu, though that is a very rare case. If that rare case really

Actually, no matter what happens after the removal of the first slab in
cpu_partial, it would leads to problem.

>happened, the reading of oldpage->pobjects may get a 0xdead0000
>unexpectedly, stored in _pobjects_, if the reading happens just after
>another CPU removed the slab from kmem_cache_node, setting lru.prev to
>LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
>prevents slabs from being moved to kmem_cache_node and being finally freed.
>
>We see in a vmcore, there are 375210 slabs kept in the partial list of one
>kmem_cache_cpu, but only 305 in-use objects in the same list for
>kmalloc-2048 cache. We see negative values for page.pobjects, the last page
>with negative _pobjects_ has the value of 0xdead0004, the next page looks
>good (_pobjects is 1).
>
>For the fix, I wanted to call this_cpu_cmpxchg_double with
>oldpage->pobjects, but failed due to size difference between
>oldpage->pobjects and cpu_slab->partial. So I changed to call
>this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
>happen in between, but just want to make sure the first slab did expereince
>a remove and re-add. This patch is more to call for ideas.

Maybe not an exact solution.

I took a look into the code and change log.

_tid_ is introduced by commit 8a5ec0ba42c4 ('Lockless (and preemptless)
fastpaths for slub'), which is used to guard cpu_freelist. While we don't
modify _tid_ when cpu_partial changes.

May need another _tid_ for cpu_partial?

>
>Signed-off-by: Wengang Wang <[email protected]>
>---
> mm/slub.c | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
>diff --git a/mm/slub.c b/mm/slub.c
>index e3629cd..26539e6 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -2248,6 +2248,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> {
> #ifdef CONFIG_SLUB_CPU_PARTIAL
> struct page *oldpage;
>+ unsigned long tid;
> int pages;
> int pobjects;
>
>@@ -2255,8 +2256,12 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> do {
> pages = 0;
> pobjects = 0;
>- oldpage = this_cpu_read(s->cpu_slab->partial);
>
>+ tid = this_cpu_read(s->cpu_slab->tid);
>+ /* read tid before reading oldpage */
>+ barrier();
>+
>+ oldpage = this_cpu_read(s->cpu_slab->partial);
> if (oldpage) {
> pobjects = oldpage->pobjects;
> pages = oldpage->pages;
>@@ -2283,8 +2288,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> page->pobjects = pobjects;
> page->next = oldpage;
>
>- } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>- != oldpage);
>+ /* we dont' change tid, but want to make sure it didn't change
>+ * in between. We don't really hope alloc/free not happen on
>+ * this CPU, but don't want the first slab be removed from and
>+ * then re-added as head to this partial list. If that case
>+ * happened, pobjects may read 0xdead0000 when this slab is just
>+ * removed from kmem_cache_node by other CPU setting lru.prev
>+ * to LIST_POISON2.
>+ */
>+ } while (this_cpu_cmpxchg_double(s->cpu_slab->partial, s->cpu_slab->tid,
>+ oldpage, tid, page, tid) == 0);
>+
> if (unlikely(!s->cpu_partial)) {
> unsigned long flags;
>
>--
>2.9.5

--
Wei Yang
Help you, Help me

2018-11-20 02:57:38

by zhong jiang

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

On 2018/11/17 9:33, Wengang Wang wrote:
> The this_cpu_cmpxchg makes the do-while loop pass as long as the
> s->cpu_slab->partial as the same value. It doesn't care what happened to
> that slab. Interrupt is not disabled, and new alloc/free can happen in the
> interrupt handlers. Theoretically, after we have a reference to the it,
> stored in _oldpage_, the first slab on the partial list on this CPU can be
> moved to kmem_cache_node and then moved to different kmem_cache_cpu and
> then somehow can be added back as head to partial list of current
> kmem_cache_cpu, though that is a very rare case. If that rare case really
> happened, the reading of oldpage->pobjects may get a 0xdead0000
> unexpectedly, stored in _pobjects_, if the reading happens just after
> another CPU removed the slab from kmem_cache_node, setting lru.prev to
> LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
> prevents slabs from being moved to kmem_cache_node and being finally freed.
>
> We see in a vmcore, there are 375210 slabs kept in the partial list of one
> kmem_cache_cpu, but only 305 in-use objects in the same list for
> kmalloc-2048 cache. We see negative values for page.pobjects, the last page
> with negative _pobjects_ has the value of 0xdead0004, the next page looks
> good (_pobjects is 1).
>
> For the fix, I wanted to call this_cpu_cmpxchg_double with
> oldpage->pobjects, but failed due to size difference between
> oldpage->pobjects and cpu_slab->partial. So I changed to call
> this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
> happen in between, but just want to make sure the first slab did expereince
> a remove and re-add. This patch is more to call for ideas.
Have you hit the really issue or just review the code ?

I did hit the issue and fixed in the upstream patch unpredictably by the following patch.
e5d9998f3e09 ("slub: make ->cpu_partial unsigned int")

Thanks,
zhong jiang
> Signed-off-by: Wengang Wang <[email protected]>
> ---
> mm/slub.c | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index e3629cd..26539e6 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2248,6 +2248,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> {
> #ifdef CONFIG_SLUB_CPU_PARTIAL
> struct page *oldpage;
> + unsigned long tid;
> int pages;
> int pobjects;
>
> @@ -2255,8 +2256,12 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> do {
> pages = 0;
> pobjects = 0;
> - oldpage = this_cpu_read(s->cpu_slab->partial);
>
> + tid = this_cpu_read(s->cpu_slab->tid);
> + /* read tid before reading oldpage */
> + barrier();
> +
> + oldpage = this_cpu_read(s->cpu_slab->partial);
> if (oldpage) {
> pobjects = oldpage->pobjects;
> pages = oldpage->pages;
> @@ -2283,8 +2288,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> page->pobjects = pobjects;
> page->next = oldpage;
>
> - } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
> - != oldpage);
> + /* we dont' change tid, but want to make sure it didn't change
> + * in between. We don't really hope alloc/free not happen on
> + * this CPU, but don't want the first slab be removed from and
> + * then re-added as head to this partial list. If that case
> + * happened, pobjects may read 0xdead0000 when this slab is just
> + * removed from kmem_cache_node by other CPU setting lru.prev
> + * to LIST_POISON2.
> + */
> + } while (this_cpu_cmpxchg_double(s->cpu_slab->partial, s->cpu_slab->tid,
> + oldpage, tid, page, tid) == 0);
> +
> if (unlikely(!s->cpu_partial)) {
> unsigned long flags;
>

2018-11-20 03:06:01

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

Hi Wengang,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.20-rc3 next-20181119]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Wengang-Wang/mm-use-this_cpu_cmpxchg_double-in-put_cpu_partial/20181119-215159
config: x86_64-kexec (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

All warnings (new ones prefixed by >>):

>> mm/slub.o: warning: objtool: get_partial_node.isra.19()+0x27e: unreachable instruction

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation

Attachments:

(No filename) (882.00 B)
.config.gz (25.56 kB)
Download all attachments

2018-11-20 18:17:07

by Wengang Wang

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

Hi Wei,

On 2018/11/17 17:02, Wei Yang wrote:
> On Fri, Nov 16, 2018 at 05:33:35PM -0800, Wengang Wang wrote:
>> The this_cpu_cmpxchg makes the do-while loop pass as long as the
>> s->cpu_slab->partial as the same value. It doesn't care what happened to
>> that slab. Interrupt is not disabled, and new alloc/free can happen in the
> Well, I seems to understand your description.
>
> There are two slabs
>
> * one which put_cpu_partial() trying to free an object
> * one which is the first slab in cpu_partial list
>
> There is some tricky case, the first slab in cpu_partial list we
> reference to will change since interrupt is not disabled.
Yes, two slabs involved here just as you said above.
And yes, the case is really tricky, but it's there.

>> interrupt handlers. Theoretically, after we have a reference to the it,
> ^^^
> one more word?
sorry, "the" should not be there.

>> stored in _oldpage_, the first slab on the partial list on this CPU can be
> ^^^
> One little suggestion here, mayby use cpu_partial would be more easy to
> understand. I confused this with the partial list in kmem_cache_node at
> the first time. :-)
Right, making others understanding easily is very important. I just
meant cpu_partial.

>> moved to kmem_cache_node and then moved to different kmem_cache_cpu and
>> then somehow can be added back as head to partial list of current
>> kmem_cache_cpu, though that is a very rare case. If that rare case really
> Actually, no matter what happens after the removal of the first slab in
> cpu_partial, it would leads to problem.
Maybe you are right, what I see is the problem on the page->pobjects.

>
>> happened, the reading of oldpage->pobjects may get a 0xdead0000
>> unexpectedly, stored in _pobjects_, if the reading happens just after
>> another CPU removed the slab from kmem_cache_node, setting lru.prev to
>> LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
>> prevents slabs from being moved to kmem_cache_node and being finally freed.
>>
>> We see in a vmcore, there are 375210 slabs kept in the partial list of one
>> kmem_cache_cpu, but only 305 in-use objects in the same list for
>> kmalloc-2048 cache. We see negative values for page.pobjects, the last page
>> with negative _pobjects_ has the value of 0xdead0004, the next page looks
>> good (_pobjects is 1).
>>
>> For the fix, I wanted to call this_cpu_cmpxchg_double with
>> oldpage->pobjects, but failed due to size difference between
>> oldpage->pobjects and cpu_slab->partial. So I changed to call
>> this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
>> happen in between, but just want to make sure the first slab did expereince
>> a remove and re-add. This patch is more to call for ideas.
> Maybe not an exact solution.
>
> I took a look into the code and change log.
>
> _tid_ is introduced by commit 8a5ec0ba42c4 ('Lockless (and preemptless)
> fastpaths for slub'), which is used to guard cpu_freelist. While we don't
> modify _tid_ when cpu_partial changes.
>
> May need another _tid_ for cpu_partial?
Right, _tid_ changes later than cpu_partial changes.

As pointed out by Zhong Jiang, the pobjects issue is fixed by commit
e5d9998f3e09 (not sure if by side effect, see my replay there),
I'd skip this patch. If we found other problems regarding the change of
cpu_partial, let's fix them. What do you think?

thanks,
wengang

>> Signed-off-by: Wengang Wang <[email protected]>
>> ---
>> mm/slub.c | 20 +++++++++++++++++---
>> 1 file changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index e3629cd..26539e6 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2248,6 +2248,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>> {
>> #ifdef CONFIG_SLUB_CPU_PARTIAL
>> struct page *oldpage;
>> + unsigned long tid;
>> int pages;
>> int pobjects;
>>
>> @@ -2255,8 +2256,12 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>> do {
>> pages = 0;
>> pobjects = 0;
>> - oldpage = this_cpu_read(s->cpu_slab->partial);
>>
>> + tid = this_cpu_read(s->cpu_slab->tid);
>> + /* read tid before reading oldpage */
>> + barrier();
>> +
>> + oldpage = this_cpu_read(s->cpu_slab->partial);
>> if (oldpage) {
>> pobjects = oldpage->pobjects;
>> pages = oldpage->pages;
>> @@ -2283,8 +2288,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>> page->pobjects = pobjects;
>> page->next = oldpage;
>>
>> - } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>> - != oldpage);
>> + /* we dont' change tid, but want to make sure it didn't change
>> + * in between. We don't really hope alloc/free not happen on
>> + * this CPU, but don't want the first slab be removed from and
>> + * then re-added as head to this partial list. If that case
>> + * happened, pobjects may read 0xdead0000 when this slab is just
>> + * removed from kmem_cache_node by other CPU setting lru.prev
>> + * to LIST_POISON2.
>> + */
>> + } while (this_cpu_cmpxchg_double(s->cpu_slab->partial, s->cpu_slab->tid,
>> + oldpage, tid, page, tid) == 0);
>> +
>> if (unlikely(!s->cpu_partial)) {
>> unsigned long flags;
>>
>> --
>> 2.9.5

2018-11-20 20:13:23

by Wengang Wang

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

Hi Zhong,

On 2018/11/19 18:18, zhong jiang wrote:
> On 2018/11/17 9:33, Wengang Wang wrote:
>> The this_cpu_cmpxchg makes the do-while loop pass as long as the
>> s->cpu_slab->partial as the same value. It doesn't care what happened to
>> that slab. Interrupt is not disabled, and new alloc/free can happen in the
>> interrupt handlers. Theoretically, after we have a reference to the it,
>> stored in _oldpage_, the first slab on the partial list on this CPU can be
>> moved to kmem_cache_node and then moved to different kmem_cache_cpu and
>> then somehow can be added back as head to partial list of current
>> kmem_cache_cpu, though that is a very rare case. If that rare case really
>> happened, the reading of oldpage->pobjects may get a 0xdead0000
>> unexpectedly, stored in _pobjects_, if the reading happens just after
>> another CPU removed the slab from kmem_cache_node, setting lru.prev to
>> LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
>> prevents slabs from being moved to kmem_cache_node and being finally freed.
>>
>> We see in a vmcore, there are 375210 slabs kept in the partial list of one
>> kmem_cache_cpu, but only 305 in-use objects in the same list for
>> kmalloc-2048 cache. We see negative values for page.pobjects, the last page
>> with negative _pobjects_ has the value of 0xdead0004, the next page looks
>> good (_pobjects is 1).
>>
>> For the fix, I wanted to call this_cpu_cmpxchg_double with
>> oldpage->pobjects, but failed due to size difference between
>> oldpage->pobjects and cpu_slab->partial. So I changed to call
>> this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
>> happen in between, but just want to make sure the first slab did expereince
>> a remove and re-add. This patch is more to call for ideas.
> Have you hit the really issue or just review the code ?
Yup, I hit the real issue. The root cause is out by reviewing the code.

> I did hit the issue and fixed in the upstream patch unpredictably by the following patch.
> e5d9998f3e09 ("slub: make ->cpu_partial unsigned int")
I am not sure if the patch you mentioned intended to fix the problem here.
With that patch the negative page->pobjects would become a large
positive value,
it will win the compare with s->cpu_partial and go ahead to unfreeze the
partial slabs.
Though it may be not a perfect fix for this issue, it really fixes (or
workarounds) the issue here.
I'd like to skip my patch..

thanks,
wengang

> Thanks,
> zhong jiang
>> Signed-off-by: Wengang Wang <[email protected]>
>> ---
>> mm/slub.c | 20 +++++++++++++++++---
>> 1 file changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index e3629cd..26539e6 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2248,6 +2248,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>> {
>> #ifdef CONFIG_SLUB_CPU_PARTIAL
>> struct page *oldpage;
>> + unsigned long tid;
>> int pages;
>> int pobjects;
>>
>> @@ -2255,8 +2256,12 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>> do {
>> pages = 0;
>> pobjects = 0;
>> - oldpage = this_cpu_read(s->cpu_slab->partial);
>>
>> + tid = this_cpu_read(s->cpu_slab->tid);
>> + /* read tid before reading oldpage */
>> + barrier();
>> +
>> + oldpage = this_cpu_read(s->cpu_slab->partial);
>> if (oldpage) {
>> pobjects = oldpage->pobjects;
>> pages = oldpage->pages;
>> @@ -2283,8 +2288,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>> page->pobjects = pobjects;
>> page->next = oldpage;
>>
>> - } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>> - != oldpage);
>> + /* we dont' change tid, but want to make sure it didn't change
>> + * in between. We don't really hope alloc/free not happen on
>> + * this CPU, but don't want the first slab be removed from and
>> + * then re-added as head to this partial list. If that case
>> + * happened, pobjects may read 0xdead0000 when this slab is just
>> + * removed from kmem_cache_node by other CPU setting lru.prev
>> + * to LIST_POISON2.
>> + */
>> + } while (this_cpu_cmpxchg_double(s->cpu_slab->partial, s->cpu_slab->tid,
>> + oldpage, tid, page, tid) == 0);
>> +
>> if (unlikely(!s->cpu_partial)) {
>> unsigned long flags;
>>
>

2018-11-21 03:03:55

by Wei Yang

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

On Tue, Nov 20, 2018 at 09:58:58AM -0800, Wengang Wang wrote:
>Hi Wei,
>
>
>On 2018/11/17 17:02, Wei Yang wrote:
>> On Fri, Nov 16, 2018 at 05:33:35PM -0800, Wengang Wang wrote:
>> > The this_cpu_cmpxchg makes the do-while loop pass as long as the
>> > s->cpu_slab->partial as the same value. It doesn't care what happened to
>> > that slab. Interrupt is not disabled, and new alloc/free can happen in the
>> Well, I seems to understand your description.
>>
>> There are two slabs
>>
>> * one which put_cpu_partial() trying to free an object
>> * one which is the first slab in cpu_partial list
>>
>> There is some tricky case, the first slab in cpu_partial list we
>> reference to will change since interrupt is not disabled.
>Yes, two slabs involved here just as you said above.
>And yes, the case is really tricky, but it's there.
>
>> > interrupt handlers. Theoretically, after we have a reference to the it,
>> ^^^
>> one more word?
>sorry, "the" should not be there.
>
>> > stored in _oldpage_, the first slab on the partial list on this CPU can be
>> ^^^
>> One little suggestion here, mayby use cpu_partial would be more easy to
>> understand. I confused this with the partial list in kmem_cache_node at
>> the first time. :-)
>Right, making others understanding easily is very important. I just meant
>cpu_partial.
>
>> > moved to kmem_cache_node and then moved to different kmem_cache_cpu and
>> > then somehow can be added back as head to partial list of current
>> > kmem_cache_cpu, though that is a very rare case. If that rare case really
>> Actually, no matter what happens after the removal of the first slab in
>> cpu_partial, it would leads to problem.
>Maybe you are right, what I see is the problem on the page->pobjects.
>
>>
>> > happened, the reading of oldpage->pobjects may get a 0xdead0000
>> > unexpectedly, stored in _pobjects_, if the reading happens just after
>> > another CPU removed the slab from kmem_cache_node, setting lru.prev to
>> > LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
>> > prevents slabs from being moved to kmem_cache_node and being finally freed.
>> >
>> > We see in a vmcore, there are 375210 slabs kept in the partial list of one
>> > kmem_cache_cpu, but only 305 in-use objects in the same list for
>> > kmalloc-2048 cache. We see negative values for page.pobjects, the last page
>> > with negative _pobjects_ has the value of 0xdead0004, the next page looks
>> > good (_pobjects is 1).
>> >
>> > For the fix, I wanted to call this_cpu_cmpxchg_double with
>> > oldpage->pobjects, but failed due to size difference between
>> > oldpage->pobjects and cpu_slab->partial. So I changed to call
>> > this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
>> > happen in between, but just want to make sure the first slab did expereince
>> > a remove and re-add. This patch is more to call for ideas.
>> Maybe not an exact solution.
>>
>> I took a look into the code and change log.
>>
>> _tid_ is introduced by commit 8a5ec0ba42c4 ('Lockless (and preemptless)
>> fastpaths for slub'), which is used to guard cpu_freelist. While we don't
>> modify _tid_ when cpu_partial changes.
>>
>> May need another _tid_ for cpu_partial?
>Right, _tid_ changes later than cpu_partial changes.
>
>As pointed out by Zhong Jiang, the pobjects issue is fixed by commit

Where you discussed this issue? Any reference I could get a look?

>e5d9998f3e09 (not sure if by side effect, see my replay there),

I took a look at this commit e5d9998f3e09 ('slub: make ->cpu_partial
unsigned int'), but not see some relationship between them.

Would you mind show me a link or cc me in case you have further
discussion?

Thanks.

>I'd skip this patch.?? If we found other problems regarding the change of
>cpu_partial, let's fix them. What do you think?
>
>thanks,
>wengang

--
Wei Yang
Help you, Help me

2018-11-21 03:21:03

by Wengang Wang

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

Hi Wei,

I think you will receive my reply to Zhong, But I am copying my comments
for that patch here (again):

Copy starts ==>

I am not sure if the patch you mentioned intended to fix the problem here.
With that patch the negative page->pobjects would become a large
positive value,
it will win the compare with s->cpu_partial and go ahead to unfreeze the
partial slabs.
Though it may be not a perfect fix for this issue, it really fixes (or
workarounds) the issue here.
I'd like to skip my patch..

<=== Copy ends

thanks,

wengang

On 2018/11/20 19:02, Wei Yang wrote:
> On Tue, Nov 20, 2018 at 09:58:58AM -0800, Wengang Wang wrote:
>> Hi Wei,
>>
>>
>> On 2018/11/17 17:02, Wei Yang wrote:
>>> On Fri, Nov 16, 2018 at 05:33:35PM -0800, Wengang Wang wrote:
>>>> The this_cpu_cmpxchg makes the do-while loop pass as long as the
>>>> s->cpu_slab->partial as the same value. It doesn't care what happened to
>>>> that slab. Interrupt is not disabled, and new alloc/free can happen in the
>>> Well, I seems to understand your description.
>>>
>>> There are two slabs
>>>
>>> * one which put_cpu_partial() trying to free an object
>>> * one which is the first slab in cpu_partial list
>>>
>>> There is some tricky case, the first slab in cpu_partial list we
>>> reference to will change since interrupt is not disabled.
>> Yes, two slabs involved here just as you said above.
>> And yes, the case is really tricky, but it's there.
>>
>>>> interrupt handlers. Theoretically, after we have a reference to the it,
>>> ^^^
>>> one more word?
>> sorry, "the" should not be there.
>>
>>>> stored in _oldpage_, the first slab on the partial list on this CPU can be
>>> ^^^
>>> One little suggestion here, mayby use cpu_partial would be more easy to
>>> understand. I confused this with the partial list in kmem_cache_node at
>>> the first time. :-)
>> Right, making others understanding easily is very important. I just meant
>> cpu_partial.
>>
>>>> moved to kmem_cache_node and then moved to different kmem_cache_cpu and
>>>> then somehow can be added back as head to partial list of current
>>>> kmem_cache_cpu, though that is a very rare case. If that rare case really
>>> Actually, no matter what happens after the removal of the first slab in
>>> cpu_partial, it would leads to problem.
>> Maybe you are right, what I see is the problem on the page->pobjects.
>>
>>>> happened, the reading of oldpage->pobjects may get a 0xdead0000
>>>> unexpectedly, stored in _pobjects_, if the reading happens just after
>>>> another CPU removed the slab from kmem_cache_node, setting lru.prev to
>>>> LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
>>>> prevents slabs from being moved to kmem_cache_node and being finally freed.
>>>>
>>>> We see in a vmcore, there are 375210 slabs kept in the partial list of one
>>>> kmem_cache_cpu, but only 305 in-use objects in the same list for
>>>> kmalloc-2048 cache. We see negative values for page.pobjects, the last page
>>>> with negative _pobjects_ has the value of 0xdead0004, the next page looks
>>>> good (_pobjects is 1).
>>>>
>>>> For the fix, I wanted to call this_cpu_cmpxchg_double with
>>>> oldpage->pobjects, but failed due to size difference between
>>>> oldpage->pobjects and cpu_slab->partial. So I changed to call
>>>> this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
>>>> happen in between, but just want to make sure the first slab did expereince
>>>> a remove and re-add. This patch is more to call for ideas.
>>> Maybe not an exact solution.
>>>
>>> I took a look into the code and change log.
>>>
>>> _tid_ is introduced by commit 8a5ec0ba42c4 ('Lockless (and preemptless)
>>> fastpaths for slub'), which is used to guard cpu_freelist. While we don't
>>> modify _tid_ when cpu_partial changes.
>>>
>>> May need another _tid_ for cpu_partial?
>> Right, _tid_ changes later than cpu_partial changes.
>>
>> As pointed out by Zhong Jiang, the pobjects issue is fixed by commit
> Where you discussed this issue? Any reference I could get a look?
>
>> e5d9998f3e09 (not sure if by side effect, see my replay there),
> I took a look at this commit e5d9998f3e09 ('slub: make ->cpu_partial
> unsigned int'), but not see some relationship between them.
>
> Would you mind show me a link or cc me in case you have further
> discussion?
>
> Thanks.
>
>> I'd skip this patch.?? If we found other problems regarding the change of
>> cpu_partial, let's fix them. What do you think?
>>
>> thanks,
>> wengang

2018-11-22 12:35:10

by Wei Yang

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

On Tue, Nov 20, 2018 at 07:18:13PM -0800, Wengang Wang wrote:
>Hi Wei,
>
>I think you will receive my reply to Zhong, But I am copying my comments for
>that patch here (again):
>
>Copy starts ==>
>
>I am not sure if the patch you mentioned intended to fix the problem here.
>With that patch the negative page->pobjects would become a large positive
>value,
>it will win the compare with s->cpu_partial and go ahead to unfreeze the
>partial slabs.
>Though it may be not a perfect fix for this issue, it really fixes (or
>workarounds) the issue here.
>I'd like to skip my patch..
>
><=== Copy ends

Thanks.

I still didn't get the point. Let's see whether I would get your replay
to that thread.

>
>thanks,
>
>wengang
>
>
>On 2018/11/20 19:02, Wei Yang wrote:
>> On Tue, Nov 20, 2018 at 09:58:58AM -0800, Wengang Wang wrote:
>> > Hi Wei,
>> >
>> >
>> > On 2018/11/17 17:02, Wei Yang wrote:
>> > > On Fri, Nov 16, 2018 at 05:33:35PM -0800, Wengang Wang wrote:
>> > > > The this_cpu_cmpxchg makes the do-while loop pass as long as the
>> > > > s->cpu_slab->partial as the same value. It doesn't care what happened to
>> > > > that slab. Interrupt is not disabled, and new alloc/free can happen in the
>> > > Well, I seems to understand your description.
>> > >
>> > > There are two slabs
>> > >
>> > > * one which put_cpu_partial() trying to free an object
>> > > * one which is the first slab in cpu_partial list
>> > >
>> > > There is some tricky case, the first slab in cpu_partial list we
>> > > reference to will change since interrupt is not disabled.
>> > Yes, two slabs involved here just as you said above.
>> > And yes, the case is really tricky, but it's there.
>> >
>> > > > interrupt handlers. Theoretically, after we have a reference to the it,
>> > > ^^^
>> > > one more word?
>> > sorry, "the" should not be there.
>> >
>> > > > stored in _oldpage_, the first slab on the partial list on this CPU can be
>> > > ^^^
>> > > One little suggestion here, mayby use cpu_partial would be more easy to
>> > > understand. I confused this with the partial list in kmem_cache_node at
>> > > the first time. :-)
>> > Right, making others understanding easily is very important. I just meant
>> > cpu_partial.
>> >
>> > > > moved to kmem_cache_node and then moved to different kmem_cache_cpu and
>> > > > then somehow can be added back as head to partial list of current
>> > > > kmem_cache_cpu, though that is a very rare case. If that rare case really
>> > > Actually, no matter what happens after the removal of the first slab in
>> > > cpu_partial, it would leads to problem.
>> > Maybe you are right, what I see is the problem on the page->pobjects.
>> >
>> > > > happened, the reading of oldpage->pobjects may get a 0xdead0000
>> > > > unexpectedly, stored in _pobjects_, if the reading happens just after
>> > > > another CPU removed the slab from kmem_cache_node, setting lru.prev to
>> > > > LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
>> > > > prevents slabs from being moved to kmem_cache_node and being finally freed.
>> > > >
>> > > > We see in a vmcore, there are 375210 slabs kept in the partial list of one
>> > > > kmem_cache_cpu, but only 305 in-use objects in the same list for
>> > > > kmalloc-2048 cache. We see negative values for page.pobjects, the last page
>> > > > with negative _pobjects_ has the value of 0xdead0004, the next page looks
>> > > > good (_pobjects is 1).
>> > > >
>> > > > For the fix, I wanted to call this_cpu_cmpxchg_double with
>> > > > oldpage->pobjects, but failed due to size difference between
>> > > > oldpage->pobjects and cpu_slab->partial. So I changed to call
>> > > > this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
>> > > > happen in between, but just want to make sure the first slab did expereince
>> > > > a remove and re-add. This patch is more to call for ideas.
>> > > Maybe not an exact solution.
>> > >
>> > > I took a look into the code and change log.
>> > >
>> > > _tid_ is introduced by commit 8a5ec0ba42c4 ('Lockless (and preemptless)
>> > > fastpaths for slub'), which is used to guard cpu_freelist. While we don't
>> > > modify _tid_ when cpu_partial changes.
>> > >
>> > > May need another _tid_ for cpu_partial?
>> > Right, _tid_ changes later than cpu_partial changes.
>> >
>> > As pointed out by Zhong Jiang, the pobjects issue is fixed by commit
>> Where you discussed this issue? Any reference I could get a look?
>>
>> > e5d9998f3e09 (not sure if by side effect, see my replay there),
>> I took a look at this commit e5d9998f3e09 ('slub: make ->cpu_partial
>> unsigned int'), but not see some relationship between them.
>>
>> Would you mind show me a link or cc me in case you have further
>> discussion?
>>
>> Thanks.
>>
>> > I'd skip this patch.?? If we found other problems regarding the change of
>> > cpu_partial, let's fix them. What do you think?
>> >
>> > thanks,
>> > wengang

--
Wei Yang
Help you, Help me

2018-11-22 19:50:18

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

Hi Wengang,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.20-rc3 next-20181121]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Wengang-Wang/mm-use-this_cpu_cmpxchg_double-in-put_cpu_partial/20181119-215159
config: x86_64-randconfig-e2-11191630 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

All warnings (new ones prefixed by >>):

>> mm/slub.o: warning: objtool: ___slab_alloc.constprop.72()+0x5be: unreachable instruction

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation

Attachments:

(No filename) (901.00 B)
.config.gz (32.28 kB)
Download all attachments

2018-11-23 06:46:01

by Chen, Rong A

[permalink] [raw]

Subject: [LKP] [mm] fb420465c9: kernel_BUG_at_mm/slub.c

FYI, we noticed the following commit (built with gcc-7):

commit: fb420465c9bcaf57aa6bff76ffe31add559ae1f9 ("[PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial")
url: https://github.com/0day-ci/linux/commits/Wengang-Wang/mm-use-this_cpu_cmpxchg_double-in-put_cpu_partial/20181119-215159

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 8G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

+-------------------------------------------------------+-----------+------------+
| | v4.20-rc3 | fb420465c9 |
+-------------------------------------------------------+-----------+------------+
| boot_successes | 185 | 0 |
| boot_failures | 19 | 29 |
| End_of_test:RCU_HOTPLUG | 15 | |
| BUG:kernel_reboot-without-warning_in_test_stage | 4 | |
| kernel_BUG_at_mm/slub.c | 0 | 29 |
| invalid_opcode:#[##] | 0 | 29 |
| RIP:put_cpu_partial | 0 | 29 |
| RIP:default_idle | 0 | 15 |
| Kernel_panic-not_syncing:Fatal_exception_in_interrupt | 0 | 19 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 10 |
| RIP:cma_init_reserved_areas | 0 | 1 |
| RIP:_raw_spin_unlock_irqrestore | 0 | 1 |
| RIP:console_unlock | 0 | 3 |
+-------------------------------------------------------+-----------+------------+

[ 17.997088] kernel BUG at mm/slub.c:2300!
[ 17.997617] invalid opcode: 0000 [#1] SMP PTI
[ 18.001088] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.20.0-rc3-00001-gfb42046 #3
[ 18.001088] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 18.001088] RIP: 0010:put_cpu_partial+0x8f/0x93
[ 18.001088] Code: 31 c9 66 8b 45 2a 0f b7 55 28 ff c1 89 4d 10 48 89 7d 08 25 ff 7f 00 00 29 d0 01 f0 89 45 14 48 8b 03 48 83 c0 18 a8 0f 74 02 <0f> 0b 0f 0b 0f 1f 44 00 00 4c 8d 54 24 08 48 83 e4 f0 41 ff 72 f8
[ 18.001088] RSP: 0000:ffff88823fc03ed8 EFLAGS: 00010202
[ 18.001088] RAX: 0000000000025a98 RBX: ffff88822b053200 RCX: 0000000000000001
[ 18.001088] RDX: 0000000000000014 RSI: 0000000000000000 RDI: 0000000000000000
[ 18.001088] RBP: ffffea0008a8a680 R08: 0000000000000001 R09: ffffffff810b8700
[ 18.001088] R10: ffffea0008a8a680 R11: 0000000000000020 R12: 000000000000000a
[ 18.001088] R13: 0000000000000202 R14: 0000000000000001 R15: 0000000000000000
[ 18.001088] FS: 0000000000000000(0000) GS:ffff88823fc00000(0000) knlGS:0000000000000000
[ 18.001088] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 18.001088] CR2: 00000000ffffffff CR3: 0000000002212001 CR4: 00000000001606f0
[ 18.001088] Call Trace:
[ 18.001088] <IRQ>
[ 18.001088] kmem_cache_free+0x132/0x191
[ 18.001088] rcu_process_callbacks+0x223/0x383
[ 18.001088] __do_softirq+0x120/0x297
[ 18.001088] ? clockevents_program_event+0xbc/0xde
[ 18.001088] irq_exit+0x5d/0x9c
[ 18.001088] smp_apic_timer_interrupt+0x127/0x13a
[ 18.001088] apic_timer_interrupt+0xf/0x20
[ 18.001088] </IRQ>
[ 18.001088] RIP: 0010:default_idle+0xa3/0x142
[ 18.001088] Code: 00 00 00 e8 b7 01 25 00 48 83 3b 00 eb e0 e8 f5 8d 70 ff 89 ee 48 c7 c7 00 02 26 82 e8 d4 4a 70 ff 65 ff 0d f7 39 66 7e fb f4 <65> 44 8b 25 ed de 65 7e 8b 05 7f a3 98 00 85 c0 0f 8e 82 00 00 00
[ 18.001088] RSP: 0000:ffffffff82203ea8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[ 18.001088] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[ 18.001088] RDX: 0000000000000066 RSI: ffff88823fc1c280 RDI: 0000000000000000
[ 18.001088] RBP: 0000000000000000 R08: 0000000080000863 R09: 0000000000000006
[ 18.001088] R10: fffffffffffffff0 R11: 0000000000025900 R12: 0000000000000000
[ 18.001088] R13: ffffffff8270d2e0 R14: 0000000000000000 R15: 0000000000000000
[ 18.001088] do_idle+0xdd/0x1ea
[ 18.001088] cpu_startup_entry+0x1d/0x1f
[ 18.001088] start_kernel+0x48b/0x4ab
[ 18.001088] secondary_startup_64+0xa4/0xb0
[ 18.001088] Modules linked in:
[ 18.001488] ---[ end trace 71ecba4b74b83907 ]---

To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email

Thanks,
lkp

Attachments:

(No filename) (4.70 kB)
config-4.20.0-rc3-00001-gfb42046 (109.41 kB)
job-script (4.46 kB)
dmesg.xz (7.36 kB)
Download all attachments

2018-11-26 02:03:02

by Wei Yang

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

On Tue, Nov 20, 2018 at 10:58 AM zhong jiang <[email protected]> wrote:
>
> On 2018/11/17 9:33, Wengang Wang wrote:
> > The this_cpu_cmpxchg makes the do-while loop pass as long as the
> > s->cpu_slab->partial as the same value. It doesn't care what happened to
> > that slab. Interrupt is not disabled, and new alloc/free can happen in the
> > interrupt handlers. Theoretically, after we have a reference to the it,
> > stored in _oldpage_, the first slab on the partial list on this CPU can be
> > moved to kmem_cache_node and then moved to different kmem_cache_cpu and
> > then somehow can be added back as head to partial list of current
> > kmem_cache_cpu, though that is a very rare case. If that rare case really
> > happened, the reading of oldpage->pobjects may get a 0xdead0000
> > unexpectedly, stored in _pobjects_, if the reading happens just after
> > another CPU removed the slab from kmem_cache_node, setting lru.prev to
> > LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
> > prevents slabs from being moved to kmem_cache_node and being finally freed.
> >
> > We see in a vmcore, there are 375210 slabs kept in the partial list of one
> > kmem_cache_cpu, but only 305 in-use objects in the same list for
> > kmalloc-2048 cache. We see negative values for page.pobjects, the last page
> > with negative _pobjects_ has the value of 0xdead0004, the next page looks
> > good (_pobjects is 1).
> >
> > For the fix, I wanted to call this_cpu_cmpxchg_double with
> > oldpage->pobjects, but failed due to size difference between
> > oldpage->pobjects and cpu_slab->partial. So I changed to call
> > this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
> > happen in between, but just want to make sure the first slab did expereince
> > a remove and re-add. This patch is more to call for ideas.
> Have you hit the really issue or just review the code ?
>
> I did hit the issue and fixed in the upstream patch unpredictably by the following patch.
> e5d9998f3e09 ("slub: make ->cpu_partial unsigned int")
>

Zhong,

I took a look into your upstream patch, while I am confused how your patch
fix this issue?

In put_cpu_partial(), the cmpxchg compare cpu_slab->partial (a page struct)
instead of the cpu_partial (an unsigned integer). I didn't get the
point of this fix.

> Thanks,
> zhong jiang
> > Signed-off-by: Wengang Wang <[email protected]>
> > ---
> > mm/slub.c | 20 +++++++++++++++++---
> > 1 file changed, 17 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index e3629cd..26539e6 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2248,6 +2248,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> > {
> > #ifdef CONFIG_SLUB_CPU_PARTIAL
> > struct page *oldpage;
> > + unsigned long tid;
> > int pages;
> > int pobjects;
> >
> > @@ -2255,8 +2256,12 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> > do {
> > pages = 0;
> > pobjects = 0;
> > - oldpage = this_cpu_read(s->cpu_slab->partial);
> >
> > + tid = this_cpu_read(s->cpu_slab->tid);
> > + /* read tid before reading oldpage */
> > + barrier();
> > +
> > + oldpage = this_cpu_read(s->cpu_slab->partial);
> > if (oldpage) {
> > pobjects = oldpage->pobjects;
> > pages = oldpage->pages;
> > @@ -2283,8 +2288,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> > page->pobjects = pobjects;
> > page->next = oldpage;
> >
> > - } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
> > - != oldpage);
> > + /* we dont' change tid, but want to make sure it didn't change
> > + * in between. We don't really hope alloc/free not happen on
> > + * this CPU, but don't want the first slab be removed from and
> > + * then re-added as head to this partial list. If that case
> > + * happened, pobjects may read 0xdead0000 when this slab is just
> > + * removed from kmem_cache_node by other CPU setting lru.prev
> > + * to LIST_POISON2.
> > + */
> > + } while (this_cpu_cmpxchg_double(s->cpu_slab->partial, s->cpu_slab->tid,
> > + oldpage, tid, page, tid) == 0);
> > +
> > if (unlikely(!s->cpu_partial)) {
> > unsigned long flags;
> >
>
>

2018-11-26 17:02:53

by Wengang Wang

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

On 2018/11/25 17:59, Wei Yang wrote:
> On Tue, Nov 20, 2018 at 10:58 AM zhong jiang <[email protected]> wrote:
>> On 2018/11/17 9:33, Wengang Wang wrote:
>>> The this_cpu_cmpxchg makes the do-while loop pass as long as the
>>> s->cpu_slab->partial as the same value. It doesn't care what happened to
>>> that slab. Interrupt is not disabled, and new alloc/free can happen in the
>>> interrupt handlers. Theoretically, after we have a reference to the it,
>>> stored in _oldpage_, the first slab on the partial list on this CPU can be
>>> moved to kmem_cache_node and then moved to different kmem_cache_cpu and
>>> then somehow can be added back as head to partial list of current
>>> kmem_cache_cpu, though that is a very rare case. If that rare case really
>>> happened, the reading of oldpage->pobjects may get a 0xdead0000
>>> unexpectedly, stored in _pobjects_, if the reading happens just after
>>> another CPU removed the slab from kmem_cache_node, setting lru.prev to
>>> LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
>>> prevents slabs from being moved to kmem_cache_node and being finally freed.
>>>
>>> We see in a vmcore, there are 375210 slabs kept in the partial list of one
>>> kmem_cache_cpu, but only 305 in-use objects in the same list for
>>> kmalloc-2048 cache. We see negative values for page.pobjects, the last page
>>> with negative _pobjects_ has the value of 0xdead0004, the next page looks
>>> good (_pobjects is 1).
>>>
>>> For the fix, I wanted to call this_cpu_cmpxchg_double with
>>> oldpage->pobjects, but failed due to size difference between
>>> oldpage->pobjects and cpu_slab->partial. So I changed to call
>>> this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
>>> happen in between, but just want to make sure the first slab did expereince
>>> a remove and re-add. This patch is more to call for ideas.
>> Have you hit the really issue or just review the code ?
>>
>> I did hit the issue and fixed in the upstream patch unpredictably by the following patch.
>> e5d9998f3e09 ("slub: make ->cpu_partial unsigned int")
>>
> Zhong,
>
> I took a look into your upstream patch, while I am confused how your patch
> fix this issue?
>
> In put_cpu_partial(), the cmpxchg compare cpu_slab->partial (a page struct)
> instead of the cpu_partial (an unsigned integer). I didn't get the
> point of this fix.

I think the patch can't prevent pobjects from being set as 0xdead0000
(the primary 4 bytes of LIST_POISON2).
But if pobjects is treated as unsigned integer,

2266                         pobjects = oldpage->pobjects;
2267                         pages = oldpage->pages;
2268                         if (drain && pobjects > s->cpu_partial) {
2269                                 unsigned long flags;

line 2268 will be true in put_cpu_partial(), thus code goes to
unfreeze_partials(). This way the slabs in the cpu partial list can be
moved to kmem_cache_nod and then freed. So it fixes (or say workarounds)
the problem I see here (huge number of empty slabs stay in cpu partial
list).

thanks
wengang

>> Thanks,
>> zhong jiang
>>> Signed-off-by: Wengang Wang <[email protected]>
>>> ---
>>> mm/slub.c | 20 +++++++++++++++++---
>>> 1 file changed, 17 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index e3629cd..26539e6 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -2248,6 +2248,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>>> {
>>> #ifdef CONFIG_SLUB_CPU_PARTIAL
>>> struct page *oldpage;
>>> + unsigned long tid;
>>> int pages;
>>> int pobjects;
>>>
>>> @@ -2255,8 +2256,12 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>>> do {
>>> pages = 0;
>>> pobjects = 0;
>>> - oldpage = this_cpu_read(s->cpu_slab->partial);
>>>
>>> + tid = this_cpu_read(s->cpu_slab->tid);
>>> + /* read tid before reading oldpage */
>>> + barrier();
>>> +
>>> + oldpage = this_cpu_read(s->cpu_slab->partial);
>>> if (oldpage) {
>>> pobjects = oldpage->pobjects;
>>> pages = oldpage->pages;
>>> @@ -2283,8 +2288,17 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>>> page->pobjects = pobjects;
>>> page->next = oldpage;
>>>
>>> - } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>>> - != oldpage);
>>> + /* we dont' change tid, but want to make sure it didn't change
>>> + * in between. We don't really hope alloc/free not happen on
>>> + * this CPU, but don't want the first slab be removed from and
>>> + * then re-added as head to this partial list. If that case
>>> + * happened, pobjects may read 0xdead0000 when this slab is just
>>> + * removed from kmem_cache_node by other CPU setting lru.prev
>>> + * to LIST_POISON2.
>>> + */
>>> + } while (this_cpu_cmpxchg_double(s->cpu_slab->partial, s->cpu_slab->tid,
>>> + oldpage, tid, page, tid) == 0);
>>> +
>>> if (unlikely(!s->cpu_partial)) {
>>> unsigned long flags;
>>>
>>

2018-11-27 00:37:42

by Wei Yang

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

On Mon, Nov 26, 2018 at 08:57:54AM -0800, Wengang Wang wrote:
>
>
>On 2018/11/25 17:59, Wei Yang wrote:
>> On Tue, Nov 20, 2018 at 10:58 AM zhong jiang <[email protected]> wrote:
>> > On 2018/11/17 9:33, Wengang Wang wrote:
>> > > The this_cpu_cmpxchg makes the do-while loop pass as long as the
>> > > s->cpu_slab->partial as the same value. It doesn't care what happened to
>> > > that slab. Interrupt is not disabled, and new alloc/free can happen in the
>> > > interrupt handlers. Theoretically, after we have a reference to the it,
>> > > stored in _oldpage_, the first slab on the partial list on this CPU can be
>> > > moved to kmem_cache_node and then moved to different kmem_cache_cpu and
>> > > then somehow can be added back as head to partial list of current
>> > > kmem_cache_cpu, though that is a very rare case. If that rare case really
>> > > happened, the reading of oldpage->pobjects may get a 0xdead0000
>> > > unexpectedly, stored in _pobjects_, if the reading happens just after
>> > > another CPU removed the slab from kmem_cache_node, setting lru.prev to
>> > > LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
>> > > prevents slabs from being moved to kmem_cache_node and being finally freed.
>> > >
>> > > We see in a vmcore, there are 375210 slabs kept in the partial list of one
>> > > kmem_cache_cpu, but only 305 in-use objects in the same list for
>> > > kmalloc-2048 cache. We see negative values for page.pobjects, the last page
>> > > with negative _pobjects_ has the value of 0xdead0004, the next page looks
>> > > good (_pobjects is 1).
>> > >
>> > > For the fix, I wanted to call this_cpu_cmpxchg_double with
>> > > oldpage->pobjects, but failed due to size difference between
>> > > oldpage->pobjects and cpu_slab->partial. So I changed to call
>> > > this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
>> > > happen in between, but just want to make sure the first slab did expereince
>> > > a remove and re-add. This patch is more to call for ideas.
>> > Have you hit the really issue or just review the code ?
>> >
>> > I did hit the issue and fixed in the upstream patch unpredictably by the following patch.
>> > e5d9998f3e09 ("slub: make ->cpu_partial unsigned int")
>> >
>> Zhong,
>>
>> I took a look into your upstream patch, while I am confused how your patch
>> fix this issue?
>>
>> In put_cpu_partial(), the cmpxchg compare cpu_slab->partial (a page struct)
>> instead of the cpu_partial (an unsigned integer). I didn't get the
>> point of this fix.
>
>I think the patch can't prevent pobjects from being set as 0xdead0000 (the
>primary 4 bytes of LIST_POISON2).
>But if pobjects is treated as unsigned integer,
>
>2266???????????????????????????????????????????????? pobjects = oldpage->pobjects;
>2267???????????????????????????????????????????????? pages = oldpage->pages;
>2268???????????????????????????????????????????????? if (drain && pobjects > s->cpu_partial) {
>2269???????????????????????????????????????????????????????????????? unsigned long flags;
>

Ehh..., you mean (0xdead0000 > 0x02) ?

This is really a bad thing, if it wordarounds the problem like this.
I strongly don't agree this is a *fix*. This is too tricky.

>line 2268 will be true in put_cpu_partial(), thus code goes to
>unfreeze_partials(). This way the slabs in the cpu partial list can be moved
>to kmem_cache_nod and then freed. So it fixes (or say workarounds) the
>problem I see here (huge number of empty slabs stay in cpu partial list).
>
>thanks
>wengang
>
>> > Thanks,
>> > zhong jiang

--
Wei Yang
Help you, Help me

2018-11-27 01:45:02

by Wengang Wang

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

On 2018/11/26 16:36, Wei Yang wrote:
> On Mon, Nov 26, 2018 at 08:57:54AM -0800, Wengang Wang wrote:
>>
>> On 2018/11/25 17:59, Wei Yang wrote:
>>> On Tue, Nov 20, 2018 at 10:58 AM zhong jiang <[email protected]> wrote:
>>>> On 2018/11/17 9:33, Wengang Wang wrote:
>>>>> The this_cpu_cmpxchg makes the do-while loop pass as long as the
>>>>> s->cpu_slab->partial as the same value. It doesn't care what happened to
>>>>> that slab. Interrupt is not disabled, and new alloc/free can happen in the
>>>>> interrupt handlers. Theoretically, after we have a reference to the it,
>>>>> stored in _oldpage_, the first slab on the partial list on this CPU can be
>>>>> moved to kmem_cache_node and then moved to different kmem_cache_cpu and
>>>>> then somehow can be added back as head to partial list of current
>>>>> kmem_cache_cpu, though that is a very rare case. If that rare case really
>>>>> happened, the reading of oldpage->pobjects may get a 0xdead0000
>>>>> unexpectedly, stored in _pobjects_, if the reading happens just after
>>>>> another CPU removed the slab from kmem_cache_node, setting lru.prev to
>>>>> LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
>>>>> prevents slabs from being moved to kmem_cache_node and being finally freed.
>>>>>
>>>>> We see in a vmcore, there are 375210 slabs kept in the partial list of one
>>>>> kmem_cache_cpu, but only 305 in-use objects in the same list for
>>>>> kmalloc-2048 cache. We see negative values for page.pobjects, the last page
>>>>> with negative _pobjects_ has the value of 0xdead0004, the next page looks
>>>>> good (_pobjects is 1).
>>>>>
>>>>> For the fix, I wanted to call this_cpu_cmpxchg_double with
>>>>> oldpage->pobjects, but failed due to size difference between
>>>>> oldpage->pobjects and cpu_slab->partial. So I changed to call
>>>>> this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
>>>>> happen in between, but just want to make sure the first slab did expereince
>>>>> a remove and re-add. This patch is more to call for ideas.
>>>> Have you hit the really issue or just review the code ?
>>>>
>>>> I did hit the issue and fixed in the upstream patch unpredictably by the following patch.
>>>> e5d9998f3e09 ("slub: make ->cpu_partial unsigned int")
>>>>
>>> Zhong,
>>>
>>> I took a look into your upstream patch, while I am confused how your patch
>>> fix this issue?
>>>
>>> In put_cpu_partial(), the cmpxchg compare cpu_slab->partial (a page struct)
>>> instead of the cpu_partial (an unsigned integer). I didn't get the
>>> point of this fix.
>> I think the patch can't prevent pobjects from being set as 0xdead0000 (the
>> primary 4 bytes of LIST_POISON2).
>> But if pobjects is treated as unsigned integer,
>>
>> 2266???????????????????????????????????????????????? pobjects = oldpage->pobjects;
>> 2267???????????????????????????????????????????????? pages = oldpage->pages;
>> 2268???????????????????????????????????????????????? if (drain && pobjects > s->cpu_partial) {
>> 2269???????????????????????????????????????????????????????????????? unsigned long flags;
>>
> Ehh..., you mean (0xdead0000 > 0x02) ?
Yes.

> This is really a bad thing, if it wordarounds the problem like this.
It does.
> I strongly don't agree this is a *fix*. This is too tricky.
It's tricky. I don't know for what purpose the patch went in. The commit
message was just saying _pobjects_ should be "unsigned"...

thanks,
wengang

>> line 2268 will be true in put_cpu_partial(), thus code goes to
>> unfreeze_partials(). This way the slabs in the cpu partial list can be moved
>> to kmem_cache_nod and then freed. So it fixes (or say workarounds) the
>> problem I see here (huge number of empty slabs stay in cpu partial list).
>>
>> thanks
>> wengang
>>
>>>> Thanks,
>>>> zhong jiang

2018-11-27 02:52:13

by Wei Yang

[permalink] [raw]

Subject: Re: [PATCH] mm: use this_cpu_cmpxchg_double in put_cpu_partial

On Mon, Nov 26, 2018 at 05:42:28PM -0800, Wengang Wang wrote:
>
>
>On 2018/11/26 16:36, Wei Yang wrote:
>> On Mon, Nov 26, 2018 at 08:57:54AM -0800, Wengang Wang wrote:
>> >
>> > On 2018/11/25 17:59, Wei Yang wrote:
>> > > On Tue, Nov 20, 2018 at 10:58 AM zhong jiang <[email protected]> wrote:
>> > > > On 2018/11/17 9:33, Wengang Wang wrote:
>> > > > > The this_cpu_cmpxchg makes the do-while loop pass as long as the
>> > > > > s->cpu_slab->partial as the same value. It doesn't care what happened to
>> > > > > that slab. Interrupt is not disabled, and new alloc/free can happen in the
>> > > > > interrupt handlers. Theoretically, after we have a reference to the it,
>> > > > > stored in _oldpage_, the first slab on the partial list on this CPU can be
>> > > > > moved to kmem_cache_node and then moved to different kmem_cache_cpu and
>> > > > > then somehow can be added back as head to partial list of current
>> > > > > kmem_cache_cpu, though that is a very rare case. If that rare case really
>> > > > > happened, the reading of oldpage->pobjects may get a 0xdead0000
>> > > > > unexpectedly, stored in _pobjects_, if the reading happens just after
>> > > > > another CPU removed the slab from kmem_cache_node, setting lru.prev to
>> > > > > LIST_POISON2 (0xdead000000000200). The wrong _pobjects_(negative) then
>> > > > > prevents slabs from being moved to kmem_cache_node and being finally freed.
>> > > > >
>> > > > > We see in a vmcore, there are 375210 slabs kept in the partial list of one
>> > > > > kmem_cache_cpu, but only 305 in-use objects in the same list for
>> > > > > kmalloc-2048 cache. We see negative values for page.pobjects, the last page
>> > > > > with negative _pobjects_ has the value of 0xdead0004, the next page looks
>> > > > > good (_pobjects is 1).
>> > > > >
>> > > > > For the fix, I wanted to call this_cpu_cmpxchg_double with
>> > > > > oldpage->pobjects, but failed due to size difference between
>> > > > > oldpage->pobjects and cpu_slab->partial. So I changed to call
>> > > > > this_cpu_cmpxchg_double with _tid_. I don't really want no alloc/free
>> > > > > happen in between, but just want to make sure the first slab did expereince
>> > > > > a remove and re-add. This patch is more to call for ideas.
>> > > > Have you hit the really issue or just review the code ?
>> > > >
>> > > > I did hit the issue and fixed in the upstream patch unpredictably by the following patch.
>> > > > e5d9998f3e09 ("slub: make ->cpu_partial unsigned int")
>> > > >
>> > > Zhong,
>> > >
>> > > I took a look into your upstream patch, while I am confused how your patch
>> > > fix this issue?
>> > >
>> > > In put_cpu_partial(), the cmpxchg compare cpu_slab->partial (a page struct)
>> > > instead of the cpu_partial (an unsigned integer). I didn't get the
>> > > point of this fix.
>> > I think the patch can't prevent pobjects from being set as 0xdead0000 (the
>> > primary 4 bytes of LIST_POISON2).
>> > But if pobjects is treated as unsigned integer,
>> >
>> > 2266???????????????????????????????????????????????? pobjects = oldpage->pobjects;
>> > 2267???????????????????????????????????????????????? pages = oldpage->pages;
>> > 2268???????????????????????????????????????????????? if (drain && pobjects > s->cpu_partial) {
>> > 2269???????????????????????????????????????????????????????????????? unsigned long flags;
>> >
>> Ehh..., you mean (0xdead0000 > 0x02) ?
>Yes.
>
>> This is really a bad thing, if it wordarounds the problem like this.
>It does.
>> I strongly don't agree this is a *fix*. This is too tricky.
>It's tricky. I don't know for what purpose the patch went in. The commit
>message was just saying _pobjects_ should be "unsigned"...

Well, upstream accepts that commit is reasonable, since pobjects is not
possible to be negative. This is two different things.

But relates that commit with the problem here is not proper.

>
>thanks,
>wengang
>
>> > line 2268 will be true in put_cpu_partial(), thus code goes to
>> > unfreeze_partials(). This way the slabs in the cpu partial list can be moved
>> > to kmem_cache_nod and then freed. So it fixes (or say workarounds) the
>> > problem I see here (huge number of empty slabs stay in cpu partial list).
>> >
>> > thanks
>> > wengang
>> >
>> > > > Thanks,
>> > > > zhong jiang

--
Wei Yang
Help you, Help me