2016-10-26 12:52:04

by Andreas Gruenbacher

[permalink] [raw]
Subject: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

Hi,

CONFIG_VMAP_STACK has broken gfs2 and I'm trying to figure out what's
going on. What I'm seeing is the following: on a fresh gfs2 filesystem
created with:

mkfs.gfs2 -p lock_nolock $DEVICE

I get the following BUG with 4.9-rc2, CONFIG_VMAP_STACK and
CONFIG_DEBUG_VIRTUAL turned on:

kernel BUG at arch/x86/mm/physaddr.c:26!

Stack of kernel thread:

__phys_addr(x)
bit_waitqueue(word, bit)
wake_up_bit(word = &gh->gh_iflags, bit = HIF_WAIT)
gfs2_holder_wake(gh)

The gh here is on the stack of another kernel thread:

static int fill_super(struct super_block *sb, struct gfs2_args
*args, int silent)
{
struct gfs2_holder mount_gh;
}

Which is waiting on the bit with:

wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

Is accessing a struct on another kernel thread's stack no longer working?

Thanks,
Andreas


2016-10-26 15:51:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 5:51 AM, Andreas Gruenbacher
<[email protected]> wrote:
> Hi,
>
> CONFIG_VMAP_STACK has broken gfs2 and I'm trying to figure out what's
> going on. What I'm seeing is the following: on a fresh gfs2 filesystem
> created with:
>
> mkfs.gfs2 -p lock_nolock $DEVICE
>
> I get the following BUG with 4.9-rc2, CONFIG_VMAP_STACK and
> CONFIG_DEBUG_VIRTUAL turned on:
>
> kernel BUG at arch/x86/mm/physaddr.c:26!
>
> Stack of kernel thread:
>
> __phys_addr(x)
> bit_waitqueue(word, bit)
> wake_up_bit(word = &gh->gh_iflags, bit = HIF_WAIT)
> gfs2_holder_wake(gh)

It's this:

const struct zone *zone = page_zone(virt_to_page(word));

If the stack is vmalloced, then you can't find the page's zone like
that. We could look it up the slow way (ick!), but maybe another
solution would be to do:

wait_queue_head_t *wait_table;
if (virt_addr_valid(word))
wait_table = page_zone(virt_to_page(word))->wait_table;
else
wait_table = funny_wait_table;

where funny_wait_table is an extra wait table just for funny addresses.

This will scale poorly on very large NUMA systems where many zones are
simultaneously using on-stack wait_bit bits, but I suspect this is a
very rare use case.

>
> Is accessing a struct on another kernel thread's stack no longer working?

That part should be fine.

2016-10-26 16:32:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 8:51 AM, Andy Lutomirski <[email protected]> wrote:
>>
>> I get the following BUG with 4.9-rc2, CONFIG_VMAP_STACK and
>> CONFIG_DEBUG_VIRTUAL turned on:
>>
>> kernel BUG at arch/x86/mm/physaddr.c:26!
>
> const struct zone *zone = page_zone(virt_to_page(word));
>
> If the stack is vmalloced, then you can't find the page's zone like
> that. We could look it up the slow way (ick!), but maybe another
> solution would be to do:

Christ. It's that damn bit-wait craziness again with the idiotic zone lookup.

I complained about it a couple of weeks ago for entirely unrelated
reasons: it absolutely sucks donkey ass through a straw from a cache
standpoint too. It makes the page_waitqueue() thing very expensive, to
the point where it shows up as taking up 3% of CPU time on a real
load.,

PeterZ had a patch that fixed most of the performance trouble because
the page_waitqueue is actually never realistically contested, and by
making the bit-waiting use *two* bits you can avoid the slow-path cost
entirely.

But here we have a totally different issue, namely that we want to
wait on a virtual address.

Quite frankly, I think the solution is to just rip out all the insane
zone crap. The most important use (by far) for the bit-waitqueue is
for the page locking, and with the "use a second bit to show
contention", there is absolutely no reason to try to do some crazy
per-zone thing. It's a slow-path that never matters, and rather than
make things scale well, the only thing it does is to pretty much
guarantee at least one extra cache miss.

Adding MelG and the mm list to the cc (PeterZ was already there) here
just for the heads up.

Linus

2016-10-26 17:15:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 9:32 AM, Linus Torvalds
<[email protected]> wrote:
>
> Quite frankly, I think the solution is to just rip out all the insane
> zone crap.

IOW, something like the attached.

Advantage:

- just look at the number of garbage lines removed! 21
insertions(+), 182 deletions(-)

- it will actually speed up even the current case for all common
situations: no idiotic extra indirections that will take extra cache
misses

- because the bit_wait_table array is now denser (256 entries is
about 6kB of data on 64-bit with no spinlock debugging, so ~100
cachelines), maybe it gets fewer cache misses too

- we know how to handle the page_waitqueue contention issue, and it
has nothing to do with the stupid NUMA zones

The only case you actually get real page wait activity is IO, and I
suspect that hashing it out over ~100 cachelines will be more than
sufficient to avoid excessive contention, plus it's a cache-miss vs an
IO, so nobody sane cares.

The only reason it did that insane per-zone thing in the first place
that right now we access those wait-queues even when we damn well
shouldn't, and we have the solution for that.

Guys, holler if you hate this, but I think it's realistically the only
sane solution to the "wait queue on stack" issue.

Oh, and the patch is obviously entirely untested. I wouldn't want to
ruin my reputation by *testing* the patches I send out. What would be
the fun in that?

Linus


Attachments:
patch.diff (10.29 kB)

2016-10-26 17:58:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 10:15 AM, Linus Torvalds
<[email protected]> wrote:
>
> Oh, and the patch is obviously entirely untested. I wouldn't want to
> ruin my reputation by *testing* the patches I send out. What would be
> the fun in that?

So I tested it. It compiles, and it actually also solves the
performance problem I was complaining about a couple of weeks ago with
"unlock_page()" having an insane 3% CPU overhead when doing lots of
small script ("make -j16 test" in the git tree for those that weren't
involved in the original thread three weeks ago).

So quite frankly, I'll just commit it. It should fix the new problem
with gfs2 and CONFIG_VMAP_STACK, and I see no excuse for the crazy
zone stuff considering how harmful it is to everybody else.

I expect that when the NUMA people complain about page locking (if
they ever even notice), PeterZ will stand up like the hero he is, and
say "look here, I can solve this for you".

Linus

2016-10-26 18:04:39

by Bob Peterson

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

----- Original Message -----
| On Wed, Oct 26, 2016 at 10:15 AM, Linus Torvalds
| <[email protected]> wrote:
| >
| > Oh, and the patch is obviously entirely untested. I wouldn't want to
| > ruin my reputation by *testing* the patches I send out. What would be
| > the fun in that?
|
| So I tested it. It compiles, and it actually also solves the
| performance problem I was complaining about a couple of weeks ago with
| "unlock_page()" having an insane 3% CPU overhead when doing lots of
| small script ("make -j16 test" in the git tree for those that weren't
| involved in the original thread three weeks ago).
|
| So quite frankly, I'll just commit it. It should fix the new problem
| with gfs2 and CONFIG_VMAP_STACK, and I see no excuse for the crazy
| zone stuff considering how harmful it is to everybody else.
|
| I expect that when the NUMA people complain about page locking (if
| they ever even notice), PeterZ will stand up like the hero he is, and
| say "look here, I can solve this for you".
|
| Linus
|
I can test it for you, if you give me about an hour.

Bob Peterson

2016-10-26 18:11:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 11:04 AM, Bob Peterson <[email protected]> wrote:
>
> I can test it for you, if you give me about an hour.

I can definitely wait an hour, it would be lovely to see more testing.
Especially if you have a NUMA machine and an interesting workload.

And if you actually have that NUMA machine and a load that shows the
page_waietutu effects, it would also be lovely if you can then
_additionally_ test the patch that PeterZ wrote a few weeks ago, it
was on the mm list about a month ago:

Date: Thu, 29 Sep 2016 15:08:27 +0200
From: Peter Zijlstra <[email protected]>
Subject: Re: page_waitqueue() considered harmful
Message-ID: <[email protected]>

and if you don't find it I can forward it to you (Peter had a few
versions, that latest one is the one that looked best).

Linus

2016-10-26 19:12:19

by Bob Peterson

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

----- Original Message -----
| On Wed, Oct 26, 2016 at 11:04 AM, Bob Peterson <[email protected]> wrote:
| >
| > I can test it for you, if you give me about an hour.

Sorry. I guess I underestimated the time it takes to build a kernel
on my test box. It will take a little longer, but it's compiling now.

| I can definitely wait an hour, it would be lovely to see more testing.
| Especially if you have a NUMA machine and an interesting workload.

I'll see what I can cook up.

| And if you actually have that NUMA machine and a load that shows the
| page_waietutu effects, it would also be lovely if you can then
| _additionally_ test the patch that PeterZ wrote a few weeks ago, it
| was on the mm list about a month ago:
|
| Date: Thu, 29 Sep 2016 15:08:27 +0200
| From: Peter Zijlstra <[email protected]>
| Subject: Re: page_waitqueue() considered harmful
| Message-ID: <[email protected]>
|
| and if you don't find it I can forward it to you (Peter had a few
| versions, that latest one is the one that looked best).

I'll see what I can do, but first I'll check basic functionality
and report back.

Bob Peterson

2016-10-26 20:32:05

by Mel Gorman

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 10:15:30AM -0700, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 9:32 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Quite frankly, I think the solution is to just rip out all the insane
> > zone crap.
>
> IOW, something like the attached.
>
> Advantage:
>
> - just look at the number of garbage lines removed! 21
> insertions(+), 182 deletions(-)
>
> - it will actually speed up even the current case for all common
> situations: no idiotic extra indirections that will take extra cache
> misses
>
> - because the bit_wait_table array is now denser (256 entries is
> about 6kB of data on 64-bit with no spinlock debugging, so ~100
> cachelines), maybe it gets fewer cache misses too
>
> - we know how to handle the page_waitqueue contention issue, and it
> has nothing to do with the stupid NUMA zones
>
> The only case you actually get real page wait activity is IO, and I
> suspect that hashing it out over ~100 cachelines will be more than
> sufficient to avoid excessive contention, plus it's a cache-miss vs an
> IO, so nobody sane cares.
>

IO wait activity is not all that matters. We hit the lock/unlock paths
during a lot of operations like reclaim.

False sharing is possible with either the new or old scheme so it's
irrelevant. There will be some remote NUMA cache misses which may be made
worse by false sharing. In the reclaim case, the bulk of those are hit
by kswapd. Kswapd itself doesn't care but there may be increased NUMA
traffic. By the time you hit direct reclaim, a remote cache miss is not
going to be the end of the world.

> Guys, holler if you hate this, but I think it's realistically the only
> sane solution to the "wait queue on stack" issue.
>

Hate? No.

It's not clear cut that NUMA remote accesses will be a problem. A remote
cache miss may or may not be more expensive than multiple calculations,
virt->page calculations and chasing pointers to lookup the zone and the
table. It's multiple potential local misses versus one remote.

Even if NUMA conflicts are a problem then 256 entries gives 96 cache
lines. For pages only, the hash routine could partition table space into
max(96, nr_online_nodes) partitions. It wouldn't be perfect as wait_table_t
does not align well with cache line sizes so there would be collisions
on the boundary but it'd be close enough. It would require page_waitqueue
use a different hashing function so it's not simple but it's possible if
someone is sufficiently motivated and found a workload that matters.

There is some question whether the sizing will lead to more collisions and
spurious wakeups. There is no way to predict how much of an issue that is
but I suspect a lot of those happen during reclaim anyway. If collisions
are a problem then the table could be dynamically sized in the similar
way the inode and dcache hash tables are.

I didn't test this as I don't have a machine available right now. The bulk
of what you removed was related to hotplug but the result looks hotplug safe.
So I've only Two minor nits only and a general caution to watch for increased
collisions and spurious wakeups with a minor caution about remote access
penalties.

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7f2ae99e5daf..0f088f3a2fed 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -440,33 +440,7 @@ struct zone {
> seqlock_t span_seqlock;
> #endif
>
> - /*
> - * wait_table -- the array holding the hash table
> - * wait_table_hash_nr_entries -- the size of the hash table array
> - * wait_table_bits -- wait_table_size == (1 << wait_table_bits)
> - *
> - * The purpose of all these is to keep track of the people
> - * waiting for a page to become available and make them
> - * runnable again when possible. The trouble is that this
> - * consumes a lot of space, especially when so few things
> - * wait on pages at a given time. So instead of using
> - * per-page waitqueues, we use a waitqueue hash table.
> - *
> - * The bucket discipline is to sleep on the same queue when
> - * colliding and wake all in that wait queue when removing.
> - * When something wakes, it must check to be sure its page is
> - * truly available, a la thundering herd. The cost of a
> - * collision is great, but given the expected load of the
> - * table, they should be so rare as to be outweighed by the
> - * benefits from the saved space.
> - *
> - * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
> - * primary users of these fields, and in mm/page_alloc.c
> - * free_area_init_core() performs the initialization of them.
> - */
> - wait_queue_head_t *wait_table;
> - unsigned long wait_table_hash_nr_entries;
> - unsigned long wait_table_bits;
> + int initialized;
>
> /* Write-intensive fields used from the page allocator */
> ZONE_PADDING(_pad1_)

zone_is_initialized is mostly the domain of hotplug. A potential cleanup
is to use a page flag and shrink the size of zone slightly. Nothing to
panic over.

> @@ -546,7 +520,7 @@ static inline bool zone_spans_pfn(const struct zone *zone, unsigned long pfn)
>
> static inline bool zone_is_initialized(struct zone *zone)
> {
> - return !!zone->wait_table;
> + return zone->initialized;
> }
>
> static inline bool zone_is_empty(struct zone *zone)
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 94732d1ab00a..42d4027f9e26 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7515,11 +7515,27 @@ static struct kmem_cache *task_group_cache __read_mostly;
> DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
> DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);
>
> +#define WAIT_TABLE_BITS 8
> +#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
> +static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
> +
> +wait_queue_head_t *bit_waitqueue(void *word, int bit)
> +{
> + const int shift = BITS_PER_LONG == 32 ? 5 : 6;
> + unsigned long val = (unsigned long)word << shift | bit;
> +
> + return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
> +}
> +EXPORT_SYMBOL(bit_waitqueue);
> +

Minor nit that it's unfortunate this moved to the scheduler core. It
wouldn't have been a complete disaster to add a page_waitqueue_init() or
something similar after sched_init.

--
Mel Gorman
SUSE Labs

2016-10-26 21:01:58

by Bob Peterson

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

----- Original Message -----
| On Wed, Oct 26, 2016 at 11:04 AM, Bob Peterson <[email protected]> wrote:
| >
| > I can test it for you, if you give me about an hour.
|
| I can definitely wait an hour, it would be lovely to see more testing.
| Especially if you have a NUMA machine and an interesting workload.
|
| And if you actually have that NUMA machine and a load that shows the
| page_waietutu effects, it would also be lovely if you can then
| _additionally_ test the patch that PeterZ wrote a few weeks ago, it
| was on the mm list about a month ago:
|
| Date: Thu, 29 Sep 2016 15:08:27 +0200
| From: Peter Zijlstra <[email protected]>
| Subject: Re: page_waitqueue() considered harmful
| Message-ID: <[email protected]>
|
| and if you don't find it I can forward it to you (Peter had a few
| versions, that latest one is the one that looked best).
|
| Linus
|

Hm. It didn't even boot, at least on my amd box in the lab.
I've made no attempt to debug this.

[ 2.368403] NetLabel: unlabeled traffic allowed by default
[ 2.374271] ------------[ cut here ]------------
[ 2.378877] kernel BUG at arch/x86/mm/physaddr.c:26!
[ 2.383829] invalid opcode: 0000 [#1] SMP
[ 2.387826] Modules linked in:
[ 2.390882] CPU: 11 PID: 1 Comm: swapper/0 Not tainted 4.9.0-rc2+ #1
[ 2.397219] Hardware name: Dell Inc. PowerEdge R815/06JC9T, BIOS 1.2.1 08/02/2010
[ 2.404683] task: ffff947136548000 task.stack: ffffb36043130000
[ 2.410588] RIP: 0010:[<ffffffffbe06848c>] [<ffffffffbe06848c>] __phys_addr+0x3c/0x50
[ 2.418500] RSP: 0018:ffffb36043133e10 EFLAGS: 00010287
[ 2.423798] RAX: fffff39132a822fc RBX: 0000000000000000 RCX: 0000000000000000
[ 2.430915] RDX: ffffffff00000001 RSI: 0000000000000000 RDI: ffff8800b2a822fc
[ 2.438032] RBP: ffffb36043133e10 R08: ffff9475364026f8 R09: 0000000000000000
[ 2.445151] R10: 0000000000000004 R11: 0000000000000000 R12: ffffffffbef8ce3d
[ 2.452269] R13: ffffffffbf0e0428 R14: ffffffffbef7a8cd R15: 0000000000000000
[ 2.459387] FS: 0000000000000000(0000) GS:ffff94773fa40000(0000) knlGS:0000000000000000
[ 2.467458] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2.473188] CR2: 0000000000000000 CR3: 000000057de07000 CR4: 00000000000006e0
[ 2.480306] Stack:
[ 2.482312] ffffb36043133e38 ffffffffbef8d543 ffffb36043133e38 ffffffffbe34689b
[ 2.489729] 000000007f4214f8 ffffb36043133e48 ffffffffbef8ce79 ffffb36043133ec0
[ 2.497144] ffffffffbe002190 0000000000000000 0000000000000000 ffffffffbed42f50
[ 2.504561] Call Trace:
[ 2.507005] [<ffffffffbef8d543>] save_microcode_in_initrd_amd+0x31/0x106
[ 2.513778] [<ffffffffbe34689b>] ? debugfs_create_u64+0x2b/0x30
[ 2.519769] [<ffffffffbef8ce79>] save_microcode_in_initrd+0x3c/0x45
[ 2.526110] [<ffffffffbe002190>] do_one_initcall+0x50/0x180
[ 2.531756] [<ffffffffbef7a8cd>] ? set_debug_rodata+0x12/0x12
[ 2.537573] [<ffffffffbef7b17b>] kernel_init_freeable+0x194/0x230
[ 2.543740] [<ffffffffbe7f3430>] ? rest_init+0x80/0x80
[ 2.548952] [<ffffffffbe7f343e>] kernel_init+0xe/0x100
[ 2.554164] [<ffffffffbe800c55>] ret_from_fork+0x25/0x30
[ 2.559548] Code: 48 89 f8 72 28 48 2b 05 7b a0 dc 00 48 05 00 00 00 80 48 39 c7 72 14 0f b6 0d 6a 75 ee 00 48 89 c2 48 d3 ea 48 85 d2 75 02 5d c3 <0f> 0b 48 03 05 7b 5b da 00 48 81 ff ff ff ff 3f 76 ec 0f 0b 0f
[ 2.579022] RIP [<ffffffffbe06848c>] __phys_addr+0x3c/0x50
[ 2.584590] RSP <ffffb36043133e10>
[ 2.588117] ---[ end trace 5c9b40c31651bd33 ]---
[ 2.592745] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 2.592745]
[ 2.601900] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

Regards,

Bob Peterson

2016-10-26 21:27:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 1:31 PM, Mel Gorman <[email protected]> wrote:
>
> IO wait activity is not all that matters. We hit the lock/unlock paths
> during a lot of operations like reclaim.

I doubt we do.

Yes, we hit the lock/unlock itself, but do we hit the *contention*?

The current code is nasty, and always ends up touching the wait-queue
regardless of whether it needs to or not, but we have a fix for that.

With that fixed, do we actually get contention on a per-page basis?
Because without contention, we'd never actually look up the wait-queue
at all.

I suspect that without IO, it's really really hard to actually get
that contention, because things like reclaim end up looking at the LRU
queue etc wioth their own locking, so it should look at various
individual pages one at a time, not have multiple queues look at the
same page.

>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 7f2ae99e5daf..0f088f3a2fed 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -440,33 +440,7 @@ struct zone {
>> + int initialized;
>>
>> /* Write-intensive fields used from the page allocator */
>> ZONE_PADDING(_pad1_)
>
> zone_is_initialized is mostly the domain of hotplug. A potential cleanup
> is to use a page flag and shrink the size of zone slightly. Nothing to
> panic over.

I really did that to make it very obvious that there was no semantic
change. I just set the "initialized" flag in the same place where it
used to initialize the wait_table, so that this:

>> static inline bool zone_is_initialized(struct zone *zone)
>> {
>> - return !!zone->wait_table;
>> + return zone->initialized;
>> }

ends up being obviously equivalent.

Admittedly I didn't clear it when the code cleared the wait_table
pointer, because that _seemed_ a non-issue - the zone will remain
initialized until it can't be reached any more, so there didn't seem
to be an actual problem there.

>> +#define WAIT_TABLE_BITS 8
>> +#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
>> +static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
>> +
>> +wait_queue_head_t *bit_waitqueue(void *word, int bit)
>> +{
>> + const int shift = BITS_PER_LONG == 32 ? 5 : 6;
>> + unsigned long val = (unsigned long)word << shift | bit;
>> +
>> + return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
>> +}
>> +EXPORT_SYMBOL(bit_waitqueue);
>> +
>
> Minor nit that it's unfortunate this moved to the scheduler core. It
> wouldn't have been a complete disaster to add a page_waitqueue_init() or
> something similar after sched_init.

I considered that, but decided that "minimal patch" was better. Plus,
with that bit_waitqueue() actually also being used for the page
locking queues (which act _kind of_ but not quite, like a bitlock),
the bit_wait_table is actually more core than just the bit-wait code.

In fact, I considered just renaming it to "hashed_wait_queue", because
that's effectively how we use it now, rather than being particularly
specific to the bit-waiting. But again, that would have made the patch
bigger, which I wanted to avoid since this is a post-rc2 thing due to
the gfs2 breakage.

Linus

2016-10-26 21:30:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 2:01 PM, Bob Peterson <[email protected]> wrote:
>
> Hm. It didn't even boot, at least on my amd box in the lab.
> I've made no attempt to debug this.

Hmm. Looks like a completely independent issue from the patch. Did you
try booting that machine without the patch?

> [ 2.378877] kernel BUG at arch/x86/mm/physaddr.c:26!

Ok, similar issue, I think - passing a non-1:1 address to __phys_addr().

But the call trace has nothing to do with gfs2 or the bitlocks:

> [ 2.504561] Call Trace:
> [ 2.507005] save_microcode_in_initrd_amd+0x31/0x106
> [ 2.513778] save_microcode_in_initrd+0x3c/0x45
> [ 2.526110] do_one_initcall+0x50/0x180
> [ 2.531756] ? set_debug_rodata+0x12/0x12
> [ 2.537573] kernel_init_freeable+0x194/0x230
> [ 2.543740] ? rest_init+0x80/0x80
> [ 2.548952] kernel_init+0xe/0x100
> [ 2.554164] ret_from_fork+0x25/0x30

I think this might be the

cont = __pa(container);

line in save_microcode_in_initrd_amd().

I see that Borislav is busy with some x86/microcode patches, I suspect
he already hit this. Adding Borislav to the cc.

Can you re-try without the AMD microcode driver for now? This seems to
be a separate issue from the gfs2 one.

Linus

2016-10-26 22:03:49

by Mel Gorman

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 02:26:57PM -0700, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 1:31 PM, Mel Gorman <[email protected]> wrote:
> >
> > IO wait activity is not all that matters. We hit the lock/unlock paths
> > during a lot of operations like reclaim.
>
> I doubt we do.
>
> Yes, we hit the lock/unlock itself, but do we hit the *contention*?
>
> The current code is nasty, and always ends up touching the wait-queue
> regardless of whether it needs to or not, but we have a fix for that.
>

To be clear, are you referring to PeterZ's patch that avoids the lookup? If
so, I see your point.

> With that fixed, do we actually get contention on a per-page basis?

Reclaim would have to running parallel to migrations, faults, clearing
write-protect etc. I can't think of a situation where a normal workload
would hit it regularly and/or for long durations.

> Because without contention, we'd never actually look up the wait-queue
> at all.
>
> I suspect that without IO, it's really really hard to actually get
> that contention, because things like reclaim end up looking at the LRU
> queue etc wioth their own locking, so it should look at various
> individual pages one at a time, not have multiple queues look at the
> same page.
>

Except many direct reclaimers on small LRUs while a system is thrashing --
not a case that really matters, you've already lost.

> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 7f2ae99e5daf..0f088f3a2fed 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -440,33 +440,7 @@ struct zone {
> >> + int initialized;
> >>
> >> /* Write-intensive fields used from the page allocator */
> >> ZONE_PADDING(_pad1_)
> >
> > zone_is_initialized is mostly the domain of hotplug. A potential cleanup
> > is to use a page flag and shrink the size of zone slightly. Nothing to
> > panic over.
>
> I really did that to make it very obvious that there was no semantic
> change. I just set the "initialized" flag in the same place where it
> used to initialize the wait_table, so that this:
>
> >> static inline bool zone_is_initialized(struct zone *zone)
> >> {
> >> - return !!zone->wait_table;
> >> + return zone->initialized;
> >> }
>
> ends up being obviously equivalent.
>

No problem with that.

> >> +#define WAIT_TABLE_BITS 8
> >> +#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
> >> +static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
> >> +
> >> +wait_queue_head_t *bit_waitqueue(void *word, int bit)
> >> +{
> >> + const int shift = BITS_PER_LONG == 32 ? 5 : 6;
> >> + unsigned long val = (unsigned long)word << shift | bit;
> >> +
> >> + return bit_wait_table + hash_long(val, WAIT_TABLE_BITS);
> >> +}
> >> +EXPORT_SYMBOL(bit_waitqueue);
> >> +
> >
> > Minor nit that it's unfortunate this moved to the scheduler core. It
> > wouldn't have been a complete disaster to add a page_waitqueue_init() or
> > something similar after sched_init.
>
> I considered that, but decided that "minimal patch" was better. Plus,
> with that bit_waitqueue() actually also being used for the page
> locking queues (which act _kind of_ but not quite, like a bitlock),
> the bit_wait_table is actually more core than just the bit-wait code.
>
> In fact, I considered just renaming it to "hashed_wait_queue", because
> that's effectively how we use it now, rather than being particularly
> specific to the bit-waiting. But again, that would have made the patch
> bigger, which I wanted to avoid since this is a post-rc2 thing due to
> the gfs2 breakage.
>

No objection. Shuffling it around does not make it obviously better in
any way.

In the meantime, a machine freed up. FWIW, it survived booting on a 2-socket
and about 20 minutes of bashing on reclaim paths from multiple processes
to beat on lock/unlock. I didn't do a performance comparison or gather
profile data but I wouldn't expect anything interesting from profiles
other than some cycles saved.

--
Mel Gorman
SUSE Labs

2016-10-26 22:09:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 3:03 PM, Mel Gorman <[email protected]> wrote:
>
> To be clear, are you referring to PeterZ's patch that avoids the lookup? If
> so, I see your point.

Yup, that's the one. I think you tested it. In fact, I'm sure you did,
because I remember seeing performance numbers from you ;)

So yes, I'd expect my patch on its own to quite possibly regress on
NUMA systems (although I wonder how much), but I consider PeterZ's
patch the fix to that, so I wouldn't worry about it.

Linus

2016-10-26 22:46:02

by Borislav Petkov

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 02:30:49PM -0700, Linus Torvalds wrote:
> Ok, similar issue, I think - passing a non-1:1 address to __phys_addr().
>
> But the call trace has nothing to do with gfs2 or the bitlocks:
>
> > [ 2.504561] Call Trace:
> > [ 2.507005] save_microcode_in_initrd_amd+0x31/0x106
> > [ 2.513778] save_microcode_in_initrd+0x3c/0x45
> > [ 2.526110] do_one_initcall+0x50/0x180
> > [ 2.531756] ? set_debug_rodata+0x12/0x12
> > [ 2.537573] kernel_init_freeable+0x194/0x230
> > [ 2.543740] ? rest_init+0x80/0x80
> > [ 2.548952] kernel_init+0xe/0x100
> > [ 2.554164] ret_from_fork+0x25/0x30
>
> I think this might be the
>
> cont = __pa(container);
>
> line in save_microcode_in_initrd_amd().
>
> I see that Borislav is busy with some x86/microcode patches, I suspect
> he already hit this. Adding Borislav to the cc.

Hmm, I guess that fires because that container thing is a static pointer
so it is >= PAGE_OFFSET. But I might be wrong, it is too late here for
brain to work.

In any case, looking at his Code:

0: 48 89 f8 mov %rdi,%rax
3: 72 28 jb 0x2d
5: 48 2b 05 7b a0 dc 00 sub 0xdca07b(%rip),%rax # 0xdca087
c: 48 05 00 00 00 80 add $0xffffffff80000000,%rax
12: 48 39 c7 cmp %rax,%rdi
^^^^^^^^^^^^^^^^

it could be this comparison here:

RAX: fffff39132a822fc, RDI: ffff8800b2a822fc

15: 72 14 jb 0x2b

... which sends us to the UD2.

17: 0f b6 0d 6a 75 ee 00 movzbl 0xee756a(%rip),%ecx # 0xee7588

We might end up at 0x2b from here too - that's !phys_addr_valid(x) - but
ECX is 0 while it should be 36...

1e: 48 89 c2 mov %rax,%rdx
21: 48 d3 ea shr %cl,%rdx
24: 48 85 d2 test %rdx,%rdx
27: 75 02 jne 0x2b
29: 5d pop %rbp
2a: c3 retq
2b:* 0f 0b ud2 <-- trapping instruction
2d: 48 03 05 7b 5b da 00 add 0xda5b7b(%rip),%rax # 0xda5baf
34: 48 81 ff ff ff ff 3f cmp $0x3fffffff,%rdi
3b: 76 ec jbe 0x29
3d: 0f 0b ud2
3f: 0f .byte 0xf

But again, I could be already sleeping and this could be me talking in
my sleep so don't take it too seriously.

In any case, this code was flaky and fragile for many reasons and it is
why this whole wankery is gone in the microcode loader now.

> Can you re-try without the AMD microcode driver for now?

Yeah, just boot with "dis_ucode_ldr".

--
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
--

2016-10-26 23:07:39

by Mel Gorman

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 03:09:41PM -0700, Linus Torvalds wrote:
> On Wed, Oct 26, 2016 at 3:03 PM, Mel Gorman <[email protected]> wrote:
> >
> > To be clear, are you referring to PeterZ's patch that avoids the lookup? If
> > so, I see your point.
>
> Yup, that's the one. I think you tested it. In fact, I'm sure you did,
> because I remember seeing performance numbers from you ;)
>

Yeah and the figures were fine. IIRC, 32-bit support was the main thing
that was missing but who cares, 32-bit is not going to have the NUMA issues
in any way that matters.

> So yes, I'd expect my patch on its own to quite possibly regress on
> NUMA systems (although I wonder how much),

I doubt it's a lot. Even if it does, it's doesn't matter because it's a
functional fix.

> but I consider PeterZ's
> patch the fix to that, so I wouldn't worry about it.
>

Agreed. Peter, do you plan to finish that patch?

--
Mel Gorman
SUSE Labs

2016-10-26 23:14:04

by Borislav Petkov

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 05:01:24PM -0400, Bob Peterson wrote:
> Hm. It didn't even boot, at least on my amd box in the lab.
> I've made no attempt to debug this.

Btw, can you send me your .config so that I can try to reproduce?

I'm assuming you're booting latest Linus' tree on it?

I'd need to take care of this for 4.9.

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-10-27 00:38:43

by Bob Peterson

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

----- Original Message -----
| Btw, can you send me your .config so that I can try to reproduce?

Attached, but as Linus suggested, I turned off the AMD microcode driver,
so it should be the same if you turn it back on. If you want, I can
do it and re-send so you have a more pristine .config. Let me know.

| I'm assuming you're booting latest Linus' tree on it?

Yes.

Without the AMD microcode driver, the kernel boots just fine (I knew
it wasn't GFS2 because my root partition is ext4).

I'm doing some basic smoke tests on GFS2, but it seems to be running
just fine with Linus's aforementioned patch. I'll see if I can push it
a little harder.

Bob Peterson


Attachments:
.config (175.25 kB)

2016-10-27 09:18:10

by Mel Gorman

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Thu, Oct 27, 2016 at 10:08:52AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 27, 2016 at 12:07:26AM +0100, Mel Gorman wrote:
> > > but I consider PeterZ's
> > > patch the fix to that, so I wouldn't worry about it.
> > >
> >
> > Agreed. Peter, do you plan to finish that patch?
>
> I was waiting for you guys to hash out the 32bit issue. But if we're now
> OK with having this for 64bit only, I can certainly look at doing a new
> version.
>

I've no problem with it being 64-bit only.

> I'll have to look at fixing Alpha's bitops for that first though,
> because as is that patch relies on atomics to the same word not needing
> ordering, but placing the contended/waiters bit in the high word for
> 64bit only sorta breaks that.
>

I see the problem assuming you're referring to the requirement that locked
and waiter bits are on the same word. Without it, you need a per-arch helper
that forces ordering or takes a spinlock. I doubt it's worth the trouble.

> Hurm, we could of course play games with the layout, the 64bit only
> flags don't _have_ to be at the end.
>
> Something like so could work I suppose, but then there's a slight
> regression in the page_unlock() path, where we now do an unconditional
> spinlock; iow. we loose the unlocked waitqueue_active() test.
>

I can't convince myself it's worthwhile. At least, I can't see a penalty
of potentially moving one of the two bits to the high word. It's the
same cache line and the same op when it matters.

> We could re-instate this with an #ifndef CONFIG_NUMA I suppose.. not
> pretty though.
>
> Also did the s/contended/waiters/ rename per popular request.
>
> ---
> include/linux/page-flags.h | 19 ++++++++
> include/linux/pagemap.h | 25 ++++++++--
> include/trace/events/mmflags.h | 7 +++
> mm/filemap.c | 94 +++++++++++++++++++++++++++++++++++++----
> 4 files changed, 130 insertions(+), 15 deletions(-)
>
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -73,6 +73,14 @@
> */
> enum pageflags {
> PG_locked, /* Page is locked. Don't touch. */
> +#ifdef CONFIG_NUMA
> + /*
> + * This bit must end up in the same word as PG_locked (or any other bit
> + * we're waiting on), as per all architectures their bitop
> + * implementations.
> + */
> + PG_waiters, /* The hashed waitqueue has waiters */
> +#endif
> PG_error,
> PG_referenced,
> PG_uptodate,

I don't see why it should be NUMA-specific even though with Linus'
patch, NUMA is a concern. Even then, you still need a 64BIT check
because 32BIT && NUMA is allowed on a number of architectures.

Otherwise, nothing jumped out at me but glancing through it looked very
similar to the previous patch.

--
Mel Gorman
SUSE Labs

2016-10-27 13:56:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Wed, Oct 26, 2016 at 08:37:25PM -0400, Bob Peterson wrote:
> Attached, but as Linus suggested, I turned off the AMD microcode driver,
> so it should be the same if you turn it back on. If you want, I can
> do it and re-send so you have a more pristine .config. Let me know.

Thanks, but I was able to reproduce in a VM.

Here's a fix which works here - I'd appreciate it if you ran it and
checked the microcode was applied correctly, i.e.:

$ dmesg | grep -i microcode

before and after the patch. Please paste that output in a mail too.

Thanks!

---
From: Borislav Petkov <[email protected]>
Date: Thu, 27 Oct 2016 14:03:59 +0200
Subject: [PATCH] x86/microcode/AMD: Fix more fallout from CONFIG_RANDOMIZE_MEMORY

We needed the physical address of the container in order to compute the
offset within the relocated ramdisk. And we did this by doing __pa() on
the virtual address.

However, __pa() does checks whether the physical address is within
PAGE_OFFSET and __START_KERNEL_map - see __phys_addr() - which fail
if we have CONFIG_RANDOMIZE_MEMORY enabled: we feed a virtual address
which *doesn't* have the randomization offset into a function which uses
PAGE_OFFSET which *does* have that offset.

This makes this check fire:

VIRTUAL_BUG_ON((x > y) || !phys_addr_valid(x));
^^^^^^

due to the randomization offset.

The fix is as simple as using __pa_nodebug() because we do that
randomization offset accounting later in that function ourselves.

Reported-by: Bob Peterson <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: [email protected] # 4.9
---
arch/x86/kernel/cpu/microcode/amd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/microcode/amd.c b/arch/x86/kernel/cpu/microcode/amd.c
index 620ab06bcf45..017bda12caae 100644
--- a/arch/x86/kernel/cpu/microcode/amd.c
+++ b/arch/x86/kernel/cpu/microcode/amd.c
@@ -429,7 +429,7 @@ int __init save_microcode_in_initrd_amd(void)
* We need the physical address of the container for both bitness since
* boot_params.hdr.ramdisk_image is a physical address.
*/
- cont = __pa(container);
+ cont = __pa_nodebug(container);
cont_va = container;
#endif

--
2.10.0

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-10-27 14:05:05

by Nicholas Piggin

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Thu, 27 Oct 2016 10:08:52 +0200
Peter Zijlstra <[email protected]> wrote:

> On Thu, Oct 27, 2016 at 12:07:26AM +0100, Mel Gorman wrote:
> > > but I consider PeterZ's
> > > patch the fix to that, so I wouldn't worry about it.
> > >
> >
> > Agreed. Peter, do you plan to finish that patch?
>
> I was waiting for you guys to hash out the 32bit issue. But if we're now
> OK with having this for 64bit only, I can certainly look at doing a new
> version.
>
> I'll have to look at fixing Alpha's bitops for that first though,
> because as is that patch relies on atomics to the same word not needing
> ordering, but placing the contended/waiters bit in the high word for
> 64bit only sorta breaks that.

I got mine working too. Haven't removed the bitops barrier (that's
for another day), or sorted the page flags in this one. But the core
code is there.

It's a bit more intrusive than your patch, but I like the end result
better. It just stops using the generic bit waiter code completely
and uses its own keys. Ends up being making things easier, and we
could wait for other page details too if that was ever required.

It uses the same wait bit logic for all page waiters.
It keeps PageWaiters manipulation entirely under waitqueue lock, so no
additional data races beyond existing unlocked waitqueue_active tests.
And it checks to clear the waiter bit when no waiters for that page are
in the queue, so hash collisions with long waiters don't end up dragging
us into the slowpath always.

Also didn't uninline unlock_page yet. Still causes some text expansion,
but we should revisit that.

Thanks,
Nick

---
include/linux/page-flags.h | 2 +
include/linux/pagemap.h | 23 +++---
include/trace/events/mmflags.h | 1 +
mm/filemap.c | 157 ++++++++++++++++++++++++++++++++---------
mm/swap.c | 2 +
5 files changed, 138 insertions(+), 47 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda..8059c04 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -73,6 +73,7 @@
*/
enum pageflags {
PG_locked, /* Page is locked. Don't touch. */
+ PG_waiters, /* Page has waiters, check its waitqueue */
PG_error,
PG_referenced,
PG_uptodate,
@@ -253,6 +254,7 @@ static inline int TestClearPage##uname(struct page *page) { return 0; }
TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)

__PAGEFLAG(Locked, locked, PF_NO_TAIL)
+PAGEFLAG(Waiters, waiters, PF_NO_COMPOUND) __CLEARPAGEFLAG(Waiters, waiters, PF_NO_COMPOUND)
PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND)
PAGEFLAG(Referenced, referenced, PF_HEAD)
TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index dd15d39..97f2d0b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -477,22 +477,14 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
* and for filesystems which need to wait on PG_private.
*/
extern void wait_on_page_bit(struct page *page, int bit_nr);
-
extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
-extern int wait_on_page_bit_killable_timeout(struct page *page,
- int bit_nr, unsigned long timeout);
-
-static inline int wait_on_page_locked_killable(struct page *page)
-{
- if (!PageLocked(page))
- return 0;
- return wait_on_page_bit_killable(compound_head(page), PG_locked);
-}
+extern void wake_up_page_bit(struct page *page, int bit_nr);

-extern wait_queue_head_t *page_waitqueue(struct page *page);
static inline void wake_up_page(struct page *page, int bit)
{
- __wake_up_bit(page_waitqueue(page), &page->flags, bit);
+ if (!PageWaiters(page))
+ return;
+ wake_up_page_bit(page, bit);
}

/*
@@ -508,6 +500,13 @@ static inline void wait_on_page_locked(struct page *page)
wait_on_page_bit(compound_head(page), PG_locked);
}

+static inline int wait_on_page_locked_killable(struct page *page)
+{
+ if (!PageLocked(page))
+ return 0;
+ return wait_on_page_bit_killable(compound_head(page), PG_locked);
+}
+
/*
* Wait for a page to complete writeback
*/
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 5a81ab4..7ac8c0a 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -81,6 +81,7 @@

#define __def_pageflag_names \
{1UL << PG_locked, "locked" }, \
+ {1UL << PG_waiters, "waiters" }, \
{1UL << PG_error, "error" }, \
{1UL << PG_referenced, "referenced" }, \
{1UL << PG_uptodate, "uptodate" }, \
diff --git a/mm/filemap.c b/mm/filemap.c
index 849f459..cab1f87 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -788,47 +788,137 @@ EXPORT_SYMBOL(__page_cache_alloc);
* at a cost of "thundering herd" phenomena during rare hash
* collisions.
*/
-wait_queue_head_t *page_waitqueue(struct page *page)
+static wait_queue_head_t *page_waitqueue(struct page *page)
{
const struct zone *zone = page_zone(page);

return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
}
-EXPORT_SYMBOL(page_waitqueue);

-void wait_on_page_bit(struct page *page, int bit_nr)
+struct wait_page_key {
+ struct page *page;
+ int bit_nr;
+ int page_match;
+};
+
+struct wait_page_queue {
+ struct page *page;
+ int bit_nr;
+ wait_queue_t wait;
+};
+
+static int wake_page_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
{
- DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+ struct wait_page_key *key = arg;
+ struct wait_page_queue *wait_page
+ = container_of(wait, struct wait_page_queue, wait);
+
+ if (wait_page->page != key->page)
+ return 0;
+ key->page_match = 1;

- if (test_bit(bit_nr, &page->flags))
- __wait_on_bit(page_waitqueue(page), &wait, bit_wait_io,
- TASK_UNINTERRUPTIBLE);
+ if (wait_page->bit_nr != key->bit_nr)
+ return 0;
+ if (test_bit(key->bit_nr, &key->page->flags))
+ return 0;
+
+ return autoremove_wake_function(wait, mode, sync, key);
}
-EXPORT_SYMBOL(wait_on_page_bit);

-int wait_on_page_bit_killable(struct page *page, int bit_nr)
+void wake_up_page_bit(struct page *page, int bit_nr)
{
- DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+ wait_queue_head_t *q = page_waitqueue(page);
+ struct wait_page_key key;
+ unsigned long flags;

- if (!test_bit(bit_nr, &page->flags))
- return 0;
+ key.page = page;
+ key.bit_nr = bit_nr;
+ key.page_match = 0;

- return __wait_on_bit(page_waitqueue(page), &wait,
- bit_wait_io, TASK_KILLABLE);
+ spin_lock_irqsave(&q->lock, flags);
+ __wake_up_locked_key(q, TASK_NORMAL, &key);
+ if (!waitqueue_active(q) || !key.page_match) {
+ ClearPageWaiters(page);
+ /*
+ * It's possible to miss clearing Waiters here, when we woke
+ * our page waiters, but the hashed waitqueue has waiters for
+ * other pages on it.
+ *
+ * That's okay, it's a rare case. The next waker will clear it.
+ */
+ }
+ spin_unlock_irqrestore(&q->lock, flags);
}
+EXPORT_SYMBOL(wake_up_page_bit);

-int wait_on_page_bit_killable_timeout(struct page *page,
- int bit_nr, unsigned long timeout)
+static inline int wait_on_page_bit_common(wait_queue_head_t *q,
+ struct page *page, int bit_nr, int state, bool lock)
{
- DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+ struct wait_page_queue wait_page;
+ wait_queue_t *wait = &wait_page.wait;
+ int ret = 0;

- wait.key.timeout = jiffies + timeout;
- if (!test_bit(bit_nr, &page->flags))
- return 0;
- return __wait_on_bit(page_waitqueue(page), &wait,
- bit_wait_io_timeout, TASK_KILLABLE);
+ init_wait(wait);
+ wait->func = wake_page_function;
+ wait_page.page = page;
+ wait_page.bit_nr = bit_nr;
+
+ for (;;) {
+ spin_lock_irq(&q->lock);
+
+ if (likely(list_empty(&wait->task_list))) {
+ if (lock)
+ __add_wait_queue_tail_exclusive(q, wait);
+ else
+ __add_wait_queue(q, wait);
+ SetPageWaiters(page);
+ }
+
+ set_current_state(state);
+
+ spin_unlock_irq(&q->lock);
+
+ if (likely(test_bit(bit_nr, &page->flags))) {
+ io_schedule();
+ if (unlikely(signal_pending_state(state, current))) {
+ ret = -EINTR;
+ break;
+ }
+ }
+
+ if (lock) {
+ if (!test_and_set_bit_lock(bit_nr, &page->flags))
+ break;
+ } else {
+ if (!test_bit(bit_nr, &page->flags))
+ break;
+ }
+ }
+
+ finish_wait(q, wait);
+
+ /*
+ * A signal could leave PageWaiters set. Clearing it here if
+ * !waitqueue_active would be possible, but still fail to catch it in
+ * the case of wait hash collision. We already can fail to clear wait
+ * hash collision cases, so don't bother with signals either.
+ */
+
+ return ret;
+}
+
+void wait_on_page_bit(struct page *page, int bit_nr)
+{
+ wait_queue_head_t *q = page_waitqueue(page);
+ wait_on_page_bit_common(q, page, bit_nr, TASK_UNINTERRUPTIBLE, false);
+}
+EXPORT_SYMBOL(wait_on_page_bit);
+
+int wait_on_page_bit_killable(struct page *page, int bit_nr)
+{
+ wait_queue_head_t *q = page_waitqueue(page);
+ return wait_on_page_bit_common(q, page, bit_nr, TASK_KILLABLE, false);
}
-EXPORT_SYMBOL_GPL(wait_on_page_bit_killable_timeout);

/**
* add_page_wait_queue - Add an arbitrary waiter to a page's wait queue
@@ -844,6 +934,7 @@ void add_page_wait_queue(struct page *page, wait_queue_t *waiter)

spin_lock_irqsave(&q->lock, flags);
__add_wait_queue(q, waiter);
+ SetPageWaiters(page);
spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL_GPL(add_page_wait_queue);
@@ -925,23 +1016,19 @@ EXPORT_SYMBOL_GPL(page_endio);
* __lock_page - get a lock on the page, assuming we need to sleep to get it
* @page: the page to lock
*/
-void __lock_page(struct page *page)
+void __lock_page(struct page *__page)
{
- struct page *page_head = compound_head(page);
- DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);
-
- __wait_on_bit_lock(page_waitqueue(page_head), &wait, bit_wait_io,
- TASK_UNINTERRUPTIBLE);
+ struct page *page = compound_head(__page);
+ wait_queue_head_t *q = page_waitqueue(page);
+ wait_on_page_bit_common(q, page, PG_locked, TASK_UNINTERRUPTIBLE, true);
}
EXPORT_SYMBOL(__lock_page);

-int __lock_page_killable(struct page *page)
+int __lock_page_killable(struct page *__page)
{
- struct page *page_head = compound_head(page);
- DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);
-
- return __wait_on_bit_lock(page_waitqueue(page_head), &wait,
- bit_wait_io, TASK_KILLABLE);
+ struct page *page = compound_head(__page);
+ wait_queue_head_t *q = page_waitqueue(page);
+ return wait_on_page_bit_common(q, page, PG_locked, TASK_KILLABLE, true);
}
EXPORT_SYMBOL_GPL(__lock_page_killable);

diff --git a/mm/swap.c b/mm/swap.c
index 4dcf852..844baed 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -69,6 +69,7 @@ static void __page_cache_release(struct page *page)
del_page_from_lru_list(page, lruvec, page_off_lru(page));
spin_unlock_irqrestore(zone_lru_lock(zone), flags);
}
+ __ClearPageWaiters(page);
mem_cgroup_uncharge(page);
}

@@ -784,6 +785,7 @@ void release_pages(struct page **pages, int nr, bool cold)

/* Clear Active bit in case of parallel mark_page_accessed */
__ClearPageActive(page);
+ __ClearPageWaiters(page);

list_add(&page->lru, &pages_to_free);
}
--
2.9.3

2016-10-27 14:41:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Thu, Oct 27, 2016 at 12:07:26AM +0100, Mel Gorman wrote:
> > but I consider PeterZ's
> > patch the fix to that, so I wouldn't worry about it.
> >
>
> Agreed. Peter, do you plan to finish that patch?

I was waiting for you guys to hash out the 32bit issue. But if we're now
OK with having this for 64bit only, I can certainly look at doing a new
version.

I'll have to look at fixing Alpha's bitops for that first though,
because as is that patch relies on atomics to the same word not needing
ordering, but placing the contended/waiters bit in the high word for
64bit only sorta breaks that.

Hurm, we could of course play games with the layout, the 64bit only
flags don't _have_ to be at the end.

Something like so could work I suppose, but then there's a slight
regression in the page_unlock() path, where we now do an unconditional
spinlock; iow. we loose the unlocked waitqueue_active() test.

We could re-instate this with an #ifndef CONFIG_NUMA I suppose.. not
pretty though.

Also did the s/contended/waiters/ rename per popular request.

---
include/linux/page-flags.h | 19 ++++++++
include/linux/pagemap.h | 25 ++++++++--
include/trace/events/mmflags.h | 7 +++
mm/filemap.c | 94 +++++++++++++++++++++++++++++++++++++----
4 files changed, 130 insertions(+), 15 deletions(-)

--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -73,6 +73,14 @@
*/
enum pageflags {
PG_locked, /* Page is locked. Don't touch. */
+#ifdef CONFIG_NUMA
+ /*
+ * This bit must end up in the same word as PG_locked (or any other bit
+ * we're waiting on), as per all architectures their bitop
+ * implementations.
+ */
+ PG_waiters, /* The hashed waitqueue has waiters */
+#endif
PG_error,
PG_referenced,
PG_uptodate,
@@ -231,6 +239,9 @@ static __always_inline int TestClearPage
#define TESTPAGEFLAG_FALSE(uname) \
static inline int Page##uname(const struct page *page) { return 0; }

+#define TESTPAGEFLAG_TRUE(uname) \
+static inline int Page##uname(const struct page *page) { return 1; }
+
#define SETPAGEFLAG_NOOP(uname) \
static inline void SetPage##uname(struct page *page) { }

@@ -249,10 +260,18 @@ static inline int TestClearPage##uname(s
#define PAGEFLAG_FALSE(uname) TESTPAGEFLAG_FALSE(uname) \
SETPAGEFLAG_NOOP(uname) CLEARPAGEFLAG_NOOP(uname)

+#define PAGEFLAG_TRUE(uname) TESTPAGEFLAG_TRUE(uname) \
+ SETPAGEFLAG_NOOP(uname) CLEARPAGEFLAG_NOOP(uname)
+
#define TESTSCFLAG_FALSE(uname) \
TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)

__PAGEFLAG(Locked, locked, PF_NO_TAIL)
+#ifdef CONFIG_NUMA
+PAGEFLAG(Waiters, waiters, PF_NO_TAIL)
+#else
+PAGEFLAG_TRUE(Waiters);
+#endif
PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, PF_NO_COMPOUND)
PAGEFLAG(Referenced, referenced, PF_HEAD)
TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -427,7 +427,7 @@ extern void __lock_page(struct page *pag
extern int __lock_page_killable(struct page *page);
extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
unsigned int flags);
-extern void unlock_page(struct page *page);
+extern void __unlock_page(struct page *page);

static inline int trylock_page(struct page *page)
{
@@ -458,6 +458,20 @@ static inline int lock_page_killable(str
return 0;
}

+static inline void unlock_page(struct page *page)
+{
+ page = compound_head(page);
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ clear_bit_unlock(PG_locked, &page->flags);
+ /*
+ * Since PG_locked and PG_waiters are in the same word, Program-Order
+ * ensures the load of PG_waiters must not observe a value earlier
+ * than our clear_bit() store.
+ */
+ if (PageWaiters(page))
+ __unlock_page(page);
+}
+
/*
* lock_page_or_retry - Lock the page, unless this would block and the
* caller indicated that it can handle a retry.
@@ -482,11 +496,11 @@ extern int wait_on_page_bit_killable(str
extern int wait_on_page_bit_killable_timeout(struct page *page,
int bit_nr, unsigned long timeout);

+extern int wait_on_page_lock(struct page *page, int mode);
+
static inline int wait_on_page_locked_killable(struct page *page)
{
- if (!PageLocked(page))
- return 0;
- return wait_on_page_bit_killable(compound_head(page), PG_locked);
+ return wait_on_page_lock(page, TASK_KILLABLE);
}

extern wait_queue_head_t *page_waitqueue(struct page *page);
@@ -504,8 +518,7 @@ static inline void wake_up_page(struct p
*/
static inline void wait_on_page_locked(struct page *page)
{
- if (PageLocked(page))
- wait_on_page_bit(compound_head(page), PG_locked);
+ wait_on_page_lock(page, TASK_UNINTERRUPTIBLE);
}

/*
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -55,6 +55,12 @@
__def_gfpflag_names \
) : "none"

+#ifdef CONFIG_NUMA
+#define IF_HAVE_PG_WAITERS(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_WAITERS(flag,string)
+#endif
+
#ifdef CONFIG_MMU
#define IF_HAVE_PG_MLOCK(flag,string) ,{1UL << flag, string}
#else
@@ -81,6 +87,7 @@

#define __def_pageflag_names \
{1UL << PG_locked, "locked" }, \
+IF_HAVE_PG_WAITERS(PG_waiters, "waiters" ) \
{1UL << PG_error, "error" }, \
{1UL << PG_referenced, "referenced" }, \
{1UL << PG_uptodate, "uptodate" }, \
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -860,15 +860,30 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
* The mb is necessary to enforce ordering between the clear_bit and the read
* of the waitqueue (to avoid SMP races with a parallel wait_on_page_locked()).
*/
-void unlock_page(struct page *page)
+void __unlock_page(struct page *page)
{
- page = compound_head(page);
- VM_BUG_ON_PAGE(!PageLocked(page), page);
- clear_bit_unlock(PG_locked, &page->flags);
- smp_mb__after_atomic();
- wake_up_page(page, PG_locked);
+ wait_queue_head_t *wq = page_waitqueue(page);
+ unsigned long flags;
+
+ spin_lock_irqsave(&wq->lock, flags);
+ if (waitqueue_active(wq)) {
+ struct wait_bit_key key =
+ __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked);
+
+ __wake_up_locked_key(wq, TASK_NORMAL, &key);
+ } else {
+ /*
+ * We need to do ClearPageWaiters() under wq->lock such that
+ * we serialize against prepare_to_wait() adding waiters and
+ * setting task_struct::state.
+ *
+ * See lock_page_wait().
+ */
+ ClearPageWaiters(page);
+ }
+ spin_unlock_irqrestore(&wq->lock, flags);
}
-EXPORT_SYMBOL(unlock_page);
+EXPORT_SYMBOL(__unlock_page);

/**
* end_page_writeback - end writeback against a page
@@ -921,6 +936,55 @@ void page_endio(struct page *page, bool
}
EXPORT_SYMBOL_GPL(page_endio);

+static int lock_page_wait(struct wait_bit_key *word, int mode)
+{
+ struct page *page = container_of(word->flags, struct page, flags);
+
+ /*
+ * We cannot go sleep without having PG_waiters set. This would mean
+ * nobody would issue a wakeup and we'd be stuck.
+ */
+ if (!PageWaiters(page)) {
+
+ /*
+ * There are two orderings of importance:
+ *
+ * 1)
+ *
+ * [unlock] [wait]
+ *
+ * clear PG_locked set PG_waiters
+ * test PG_waiters test (and-set) PG_locked
+ *
+ * Since these are in the same word, and the clear/set
+ * operation are atomic, they are ordered against one another.
+ * Program-Order further constraints a CPU from speculating the
+ * later load to not be earlier than the RmW. So this doesn't
+ * need an explicit barrier. Also see unlock_page().
+ *
+ * 2)
+ *
+ * [unlock] [wait]
+ *
+ * LOCK wq->lock LOCK wq->lock
+ * __wake_up_locked || list-add
+ * clear PG_waiters set_current_state()
+ * UNLOCK wq->lock UNLOCK wq->lock
+ * set PG_waiters
+ *
+ * Since we're added to the waitqueue, we cannot get
+ * PG_waiters cleared without also getting TASK_RUNNING set,
+ * which will then void the schedule() call and we'll loop.
+ * Here wq->lock is sufficient ordering. See __unlock_page().
+ */
+ SetPageWaiters(page);
+
+ return 0;
+ }
+
+ return bit_wait_io(word, mode);
+}
+
/**
* __lock_page - get a lock on the page, assuming we need to sleep to get it
* @page: the page to lock
@@ -930,7 +994,7 @@ void __lock_page(struct page *page)
struct page *page_head = compound_head(page);
DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);

- __wait_on_bit_lock(page_waitqueue(page_head), &wait, bit_wait_io,
+ __wait_on_bit_lock(page_waitqueue(page_head), &wait, lock_page_wait,
TASK_UNINTERRUPTIBLE);
}
EXPORT_SYMBOL(__lock_page);
@@ -941,10 +1005,22 @@ int __lock_page_killable(struct page *pa
DEFINE_WAIT_BIT(wait, &page_head->flags, PG_locked);

return __wait_on_bit_lock(page_waitqueue(page_head), &wait,
- bit_wait_io, TASK_KILLABLE);
+ lock_page_wait, TASK_KILLABLE);
}
EXPORT_SYMBOL_GPL(__lock_page_killable);

+int wait_on_page_lock(struct page *page, int mode)
+{
+ struct page __always_unused *__page = (page = compound_head(page));
+ DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+
+ if (!PageLocked(page))
+ return 0;
+
+ return __wait_on_bit(page_waitqueue(page), &wait, lock_page_wait, mode);
+}
+EXPORT_SYMBOL(wait_on_page_lock);
+
/*
* Return values:
* 1 - page is locked; mmap_sem is still held.

2016-10-27 14:48:35

by Mel Gorman

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Thu, Oct 27, 2016 at 11:44:49AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 27, 2016 at 10:07:42AM +0100, Mel Gorman wrote:
> > > Something like so could work I suppose, but then there's a slight
> > > regression in the page_unlock() path, where we now do an unconditional
> > > spinlock; iow. we loose the unlocked waitqueue_active() test.
> > >
> >
> > I can't convince myself it's worthwhile. At least, I can't see a penalty
> > of potentially moving one of the two bits to the high word. It's the
> > same cache line and the same op when it matters.
>
> I'm having trouble connecting these here two paragraphs. Or were you
> replying to something else?
>
> So the current unlock code does:
>
> wake_up_page()
> if (waitqueue_active())
> __wake_up() /* takes waitqueue spinlocks here */
>
> While the new one does:
>
> spin_lock(&q->lock);
> if (waitqueue_active()) {
> __wake_up_common()
> }
> spin_unlock(&q->lock);
>
> Which is an unconditional atomic op (which go for about ~20 cycles each,
> when uncontended).
>

Ok, we were thinking about different things but I'm not sure I get your
concern. With your patch, in the uncontended case we check the waiters
bit and if there is no contention, we carry on. In the contended case,
the lock is taken. Given that contention is likely to be due to IO being
completed, I don't think the atomic op on top is going to make that much
of a difference.

About the only hazard I can think of is when unrelated pages hash to the
same queue and so there is an extra op for the "fake contended" case. I
don't think it's worth worrying about given that a false contention and
atomic op might hurt some workload but the common case is avoiding a
lookup.

> > I don't see why it should be NUMA-specific even though with Linus'
> > patch, NUMA is a concern. Even then, you still need a 64BIT check
> > because 32BIT && NUMA is allowed on a number of architectures.
>
> Oh, I thought we killed 32bit NUMA and didn't check. I can make it
> CONFIG_64BIT and be done with it. s/CONFIG_NUMA/CONFIG_64BIT/ on the
> patch should do :-)
>

Sounds good.

--
Mel Gorman
SUSE Labs

2016-10-27 15:19:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Thu, Oct 27, 2016 at 10:07:42AM +0100, Mel Gorman wrote:
> > Something like so could work I suppose, but then there's a slight
> > regression in the page_unlock() path, where we now do an unconditional
> > spinlock; iow. we loose the unlocked waitqueue_active() test.
> >
>
> I can't convince myself it's worthwhile. At least, I can't see a penalty
> of potentially moving one of the two bits to the high word. It's the
> same cache line and the same op when it matters.

I'm having trouble connecting these here two paragraphs. Or were you
replying to something else?

So the current unlock code does:

wake_up_page()
if (waitqueue_active())
__wake_up() /* takes waitqueue spinlocks here */

While the new one does:

spin_lock(&q->lock);
if (waitqueue_active()) {
__wake_up_common()
}
spin_unlock(&q->lock);

Which is an unconditional atomic op (which go for about ~20 cycles each,
when uncontended).


> > +++ b/include/linux/page-flags.h
> > @@ -73,6 +73,14 @@
> > */
> > enum pageflags {
> > PG_locked, /* Page is locked. Don't touch. */
> > +#ifdef CONFIG_NUMA
> > + /*
> > + * This bit must end up in the same word as PG_locked (or any other bit
> > + * we're waiting on), as per all architectures their bitop
> > + * implementations.
> > + */
> > + PG_waiters, /* The hashed waitqueue has waiters */
> > +#endif
> > PG_error,
> > PG_referenced,
> > PG_uptodate,
>
> I don't see why it should be NUMA-specific even though with Linus'
> patch, NUMA is a concern. Even then, you still need a 64BIT check
> because 32BIT && NUMA is allowed on a number of architectures.

Oh, I thought we killed 32bit NUMA and didn't check. I can make it
CONFIG_64BIT and be done with it. s/CONFIG_NUMA/CONFIG_64BIT/ on the
patch should do :-)

> Otherwise, nothing jumped out at me but glancing through it looked very
> similar to the previous patch.

Right, all the difference was in the bit being conditional and having a
different name.

2016-10-27 18:52:38

by Bob Peterson

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

| Here's a fix which works here - I'd appreciate it if you ran it and
| checked the microcode was applied correctly, i.e.:
|
| $ dmesg | grep -i microcode
|
| before and after the patch. Please paste that output in a mail too.

Hi Borislav,

Sorry it's taken me so long. I've been having issues.
I couldn't recreate that first boot failure, even using .config.old,
and even after removing (rm -fR) my linux.git and untarring it from the
original tarball, doing a make clean, etc.
The output before and after your new patch are the same (except for the times):

# dmesg | grep -i microcode
[ 5.291679] microcode: microcode updated early to new patch_level=0x010000d9
[ 5.298761] microcode: CPU0: patch_level=0x010000d9
[ 5.303648] microcode: CPU1: patch_level=0x010000d9
[ 5.308529] microcode: CPU2: patch_level=0x010000d9
[ 5.313414] microcode: CPU3: patch_level=0x010000d9
[ 5.360834] microcode: CPU4: patch_level=0x010000d9
[ 5.365719] microcode: CPU5: patch_level=0x010000d9
[ 5.370602] microcode: CPU6: patch_level=0x010000d9
[ 5.375486] microcode: CPU7: patch_level=0x010000d9
[ 5.380372] microcode: CPU8: patch_level=0x010000d9
[ 5.385256] microcode: CPU9: patch_level=0x010000d9
[ 5.390142] microcode: CPU10: patch_level=0x010000d9
[ 5.395102] microcode: CPU11: patch_level=0x010000d9
[ 5.437813] microcode: CPU12: patch_level=0x010000d9
[ 5.442785] microcode: CPU13: patch_level=0x010000d9
[ 5.447755] microcode: CPU14: patch_level=0x010000d9
[ 5.452724] microcode: CPU15: patch_level=0x010000d9
[ 5.457756] microcode: Microcode Update Driver: v2.01 <[email protected]>, Peter Oruba
# uname -a
Linux intec2 4.9.0-rc2+ #2 SMP Thu Oct 27 14:29:32 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Regards,

Bob Peterson

2016-10-27 19:25:56

by Borislav Petkov

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Thu, Oct 27, 2016 at 02:51:30PM -0400, Bob Peterson wrote:
> I couldn't recreate that first boot failure, even using .config.old,
> and even after removing (rm -fR) my linux.git and untarring it from the
> original tarball, doing a make clean, etc.

Hmm, so it could also depend on the randomized offset as it is getting
generated anew each boot. So you could try to boot a couple of times
to see if the randomized offset is generated just right for the bug
condition to match.

I mean, it would be great if you try a couple times but even if you're
unsuccessful, that's fine too - the fix is obviously correct and I've
confirmed that it boots fine in my VM here.

> The output before and after your new patch are the same (except for the times):
>
> # dmesg | grep -i microcode
> [ 5.291679] microcode: microcode updated early to new patch_level=0x010000d9

That looks good.

Thanks!

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-10-27 21:04:12

by Bob Peterson

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

----- Original Message -----
| I mean, it would be great if you try a couple times but even if you're
| unsuccessful, that's fine too - the fix is obviously correct and I've
| confirmed that it boots fine in my VM here.

Hi Boris,

I rebooted the machine with and without your patch, about 15 times each,
and no failures. Not sure why I got it the first time. Must have been a one-off.

Bob Peterson

2016-10-27 21:19:35

by Borislav Petkov

[permalink] [raw]
Subject: Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

On Thu, Oct 27, 2016 at 05:03:13PM -0400, Bob Peterson wrote:
> I rebooted the machine with and without your patch, about 15 times
> each, and no failures. Not sure why I got it the first time. Must have
> been a one-off.

Ok, thanks for giving it a try!

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

Subject: [tip:x86/urgent] x86/microcode/AMD: Fix more fallout from CONFIG_RANDOMIZE_MEMORY=y

Commit-ID: 1c27f646b18fb56308dff82784ca61951bad0b48
Gitweb: http://git.kernel.org/tip/1c27f646b18fb56308dff82784ca61951bad0b48
Author: Borislav Petkov <[email protected]>
AuthorDate: Thu, 27 Oct 2016 14:36:23 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 28 Oct 2016 10:29:59 +0200

x86/microcode/AMD: Fix more fallout from CONFIG_RANDOMIZE_MEMORY=y

We needed the physical address of the container in order to compute the
offset within the relocated ramdisk. And we did this by doing __pa() on
the virtual address.

However, __pa() does checks whether the physical address is within
PAGE_OFFSET and __START_KERNEL_map - see __phys_addr() - which fail
if we have CONFIG_RANDOMIZE_MEMORY enabled: we feed a virtual address
which *doesn't* have the randomization offset into a function which uses
PAGE_OFFSET which *does* have that offset.

This makes this check fire:

VIRTUAL_BUG_ON((x > y) || !phys_addr_valid(x));
^^^^^^

due to the randomization offset.

The fix is as simple as using __pa_nodebug() because we do that
randomization offset accounting later in that function ourselves.

Reported-by: Bob Peterson <[email protected]>
Tested-by: Bob Peterson <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Andreas Gruenbacher <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steven Whitehouse <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: linux-mm <[email protected]>
Cc: [email protected] # 4.9
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/microcode/amd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/microcode/amd.c b/arch/x86/kernel/cpu/microcode/amd.c
index 620ab06..017bda1 100644
--- a/arch/x86/kernel/cpu/microcode/amd.c
+++ b/arch/x86/kernel/cpu/microcode/amd.c
@@ -429,7 +429,7 @@ int __init save_microcode_in_initrd_amd(void)
* We need the physical address of the container for both bitness since
* boot_params.hdr.ramdisk_image is a physical address.
*/
- cont = __pa(container);
+ cont = __pa_nodebug(container);
cont_va = container;
#endif