2008-11-17 09:07:58

by Jan Beulich

[permalink] [raw]
Subject: arch_flush_lazy_mmu_mode() in arch/x86/mm/highmem_32.c

Commit 49f19710512c825aaea73b9207b3a848027cda1d hints at the
current solution not being the final one, yet the advertised (for 2.6.22)
change apparently never happened. However, it seems to me that
flushing a potentially huge (it terms of time to completion) batch
asynchronously is no really good idea. Instead, I'd think that adding to
the batch should be prevented in asynchronous contexts altogether, or
things should properly nest. As a positive side effect, disabling interrupts
in the batch handling - in particular around the submission of the batch -
could also be avoided, reducing interrupt latency (perhaps significantly
in some case).

Likewise I would think that the flush out of vmalloc_sync_one() isn't
appropriate, and it should rather be avoided for the set_pmd() there to
get into the batching code altogether.

(Background to this: After adding lazy mmu mode support in our forward
ported tree, we've actually been hit by these calls out of kmap_...(), as
originally I didn't pay much attention to these and didn't care to synchronize
batch addition and flushing with asynchronous operations.)

Jan


2008-11-17 16:58:35

by Zachary Amsden

[permalink] [raw]
Subject: Re: arch_flush_lazy_mmu_mode() in arch/x86/mm/highmem_32.c

On Mon, 2008-11-17 at 01:08 -0800, Jan Beulich wrote:
> Commit 49f19710512c825aaea73b9207b3a848027cda1d hints at the
> current solution not being the final one, yet the advertised (for 2.6.22)
> change apparently never happened. However, it seems to me that
> flushing a potentially huge (it terms of time to completion) batch
> asynchronously is no really good idea. Instead, I'd think that adding to

The batch size is limited by many factors; the number of possible PTEs
updated is only 64 in the COW path, up to 1024 at most for zero / remap
paths / mprotect paths. While 1024 is large, I don't think it qualifies
as huge, and it should really be very uncommon. In practice, the
batching under VMI is limited by the size of the hypercall queue, which
is only 256 entries.

> the batch should be prevented in asynchronous contexts altogether, or
> things should properly nest. As a positive side effect, disabling interrupts
> in the batch handling - in particular around the submission of the batch -
> could also be avoided, reducing interrupt latency (perhaps significantly
> in some case).

Jeremy already fixed that; we don't disable interrupts, the change he
made was to flush and then immediately restart the batching.

Re: nesting properly, there is no low-level interface provided to issue
"forced" updates, that is updates which bypass any batching and take
effect immediately. We considered it, but wanted a simple design of a
strictly in-order update queue to reduce complexity. This means you
either have no batching, eliminate the scenarios which require nesting,
disable interrupts, or flush previously batched updates when nested
operations are required. The latter seems the lesser of the evils.

>
> Likewise I would think that the flush out of vmalloc_sync_one() isn't
> appropriate, and it should rather be avoided for the set_pmd() there to
> get into the batching code altogether.

That's impossible. The flush is needed and there is no way to avoid it.
The kernel has no general restrictions about contexts in which it is
safe vs. unsafe to touch the kernel's own vmalloc'ed memory, so you can
get a page fault due to lazy syncing of vmalloc area PDEs in non-PAE
mode. You really have to service that fault.

> (Background to this: After adding lazy mmu mode support in our forward
> ported tree, we've actually been hit by these calls out of kmap_...(), as
> originally I didn't pay much attention to these and didn't care to synchronize
> batch addition and flushing with asynchronous operations.)

What tree is that? Unclear from your message.

Thanks,

Zach

2008-11-17 18:40:27

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: arch_flush_lazy_mmu_mode() in arch/x86/mm/highmem_32.c

Zachary Amsden wrote:
> On Mon, 2008-11-17 at 01:08 -0800, Jan Beulich wrote:
>
>> Commit 49f19710512c825aaea73b9207b3a848027cda1d hints at the
>> current solution not being the final one, yet the advertised (for 2.6.22)
>> change apparently never happened. However, it seems to me that
>> flushing a potentially huge (it terms of time to completion) batch
>> asynchronously is no really good idea. Instead, I'd think that adding to
>>
>
> The batch size is limited by many factors; the number of possible PTEs
> updated is only 64 in the COW path, up to 1024 at most for zero / remap
> paths / mprotect paths. While 1024 is large, I don't think it qualifies
> as huge, and it should really be very uncommon. In practice, the
> batching under VMI is limited by the size of the hypercall queue, which
> is only 256 entries.
>

And the Xen multicall queue is only 32 entries too.

Also, the lazy updates can only be deferred for as long as you're
holding the appropriate pte lock, so its already a non-preemptible
path. The batch-executing hypercall could cause a bit of an interrupt
hiccup, but that's really a tradeoff between large batches and hypercall
overhead. I thought a 32x reduction in hypercall cost would be
reasonable in terms of increased interrupt latency.

>> the batch should be prevented in asynchronous contexts altogether, or
>> things should properly nest. As a positive side effect, disabling interrupts
>> in the batch handling - in particular around the submission of the batch -
>> could also be avoided, reducing interrupt latency (perhaps significantly
>> in some case).
>>
>
> Jeremy already fixed that; we don't disable interrupts, the change he
> made was to flush and then immediately restart the batching.
>

Yes. The Xen code only disables interrupts temporarily while actually
constructing a new multicall list member, to stop a half-constructed
multicall from being issued by a nested flush. But that's very brief,
and cheap under Xen.

> Re: nesting properly, there is no low-level interface provided to issue
> "forced" updates, that is updates which bypass any batching and take
> effect immediately. We considered it, but wanted a simple design of a
> strictly in-order update queue to reduce complexity. This means you
> either have no batching, eliminate the scenarios which require nesting,
> disable interrupts, or flush previously batched updates when nested
> operations are required. The latter seems the lesser of the evils.
>

Yeah. There's really been no call to do anything more complex here.

>> Likewise I would think that the flush out of vmalloc_sync_one() isn't
>> appropriate, and it should rather be avoided for the set_pmd() there to
>> get into the batching code altogether.
>>
>
> That's impossible. The flush is needed and there is no way to avoid it.
> The kernel has no general restrictions about contexts in which it is
> safe vs. unsafe to touch the kernel's own vmalloc'ed memory, so you can
> get a page fault due to lazy syncing of vmalloc area PDEs in non-PAE
> mode. You really have to service that fault.
>

You could do the flush in the fault handler itself, rather than
vmalloc_sync_one. If you enter the handler with outstanding updates,
then flush them and return. Hm, but that only works if you're always
going from NP->P; if you're doing P->P updates then you may just end up
with stale mappings.

>> (Background to this: After adding lazy mmu mode support in our forward
>> ported tree, we've actually been hit by these calls out of kmap_...(), as
>> originally I didn't pay much attention to these and didn't care to synchronize
>> batch addition and flushing with asynchronous operations.)
>>
>
> What tree is that? Unclear from your message.
>

The Novell kernel tree. Jan's been doggedly forward-porting the old Xen
patches.

J

2008-11-17 19:00:22

by Zachary Amsden

[permalink] [raw]
Subject: Re: arch_flush_lazy_mmu_mode() in arch/x86/mm/highmem_32.c

On Mon, 2008-11-17 at 10:40 -0800, Jeremy Fitzhardinge wrote:

> Yes. The Xen code only disables interrupts temporarily while actually
> constructing a new multicall list member, to stop a half-constructed
> multicall from being issued by a nested flush. But that's very brief,
> and cheap under Xen.

We have truly magical ways of doing the same thing.

> You could do the flush in the fault handler itself, rather than
> vmalloc_sync_one. If you enter the handler with outstanding updates,
> then flush them and return. Hm, but that only works if you're always
> going from NP->P; if you're doing P->P updates then you may just end up
> with stale mappings.

vmalloc_sync_one really is just the fault handler, factored out to look
nice... in any case, the faults here will aways be NP->P; once created,
the page tables handling the vmalloc area will never be released, so the
PDE never transitions from P->P or P->NP (the PTEs do).

> The Novell kernel tree. Jan's been doggedly forward-porting the old Xen
> patches.

Okay, that explains it... the patch sequence here contains a bit of
"fun" IIRC.

2008-11-18 08:02:37

by Jan Beulich

[permalink] [raw]
Subject: Re: arch_flush_lazy_mmu_mode() in arch/x86/mm/highmem_32.c

>>> Jeremy Fitzhardinge <[email protected]> 17.11.08 19:40 >>>
>Zachary Amsden wrote:
>> On Mon, 2008-11-17 at 01:08 -0800, Jan Beulich wrote:
>>> the batch should be prevented in asynchronous contexts altogether, or
>>> things should properly nest. As a positive side effect, disabling interrupts
>>> in the batch handling - in particular around the submission of the batch -
>>> could also be avoided, reducing interrupt latency (perhaps significantly
>>> in some case).
>>>
>>
>> Jeremy already fixed that; we don't disable interrupts, the change he
>> made was to flush and then immediately restart the batching.
>>
>
>Yes. The Xen code only disables interrupts temporarily while actually
>constructing a new multicall list member, to stop a half-constructed
>multicall from being issued by a nested flush. But that's very brief,
>and cheap under Xen.

Where's that fixed? Even in the -tip tree I still see xen_mc_flush()
disabling interrupts (and multicalls.c didn't change for over two months)...

>>> Likewise I would think that the flush out of vmalloc_sync_one() isn't
>>> appropriate, and it should rather be avoided for the set_pmd() there to
>>> get into the batching code altogether.
>>>
>>
>> That's impossible. The flush is needed and there is no way to avoid it.
>> The kernel has no general restrictions about contexts in which it is
>> safe vs. unsafe to touch the kernel's own vmalloc'ed memory, so you can
>> get a page fault due to lazy syncing of vmalloc area PDEs in non-PAE
>> mode. You really have to service that fault.
>>
>
>You could do the flush in the fault handler itself, rather than
>vmalloc_sync_one. If you enter the handler with outstanding updates,
>then flush them and return. Hm, but that only works if you're always
>going from NP->P; if you're doing P->P updates then you may just end up
>with stale mappings.

There's no reason to do any flush at all if you suppress batching temporarily.
And it only needs (would need) explicit suppressing here because you can't
easily recognize being in the context of a page fault handler from the
batching functions (other than recognizing being in the context of an
interrupt handler, which is what would allow removing the flush calls from
highmem_32.c).

Jan

2008-11-18 17:01:44

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: arch_flush_lazy_mmu_mode() in arch/x86/mm/highmem_32.c

Jan Beulich wrote:
>>>> Jeremy Fitzhardinge <[email protected]> 17.11.08 19:40 >>>
>>>>
>> Zachary Amsden wrote:
>>
>>> On Mon, 2008-11-17 at 01:08 -0800, Jan Beulich wrote:
>>>
>>>> the batch should be prevented in asynchronous contexts altogether, or
>>>> things should properly nest. As a positive side effect, disabling interrupts
>>>> in the batch handling - in particular around the submission of the batch -
>>>> could also be avoided, reducing interrupt latency (perhaps significantly
>>>> in some case).
>>>>
>>>>
>>> Jeremy already fixed that; we don't disable interrupts, the change he
>>> made was to flush and then immediately restart the batching.
>>>
>>>
>> Yes. The Xen code only disables interrupts temporarily while actually
>> constructing a new multicall list member, to stop a half-constructed
>> multicall from being issued by a nested flush. But that's very brief,
>> and cheap under Xen.
>>
>
> Where's that fixed? Even in the -tip tree I still see xen_mc_flush()
> disabling interrupts (and multicalls.c didn't change for over two months)...
>

Yes, it disables interrupts while its actually issuing the multicall. I
don't think that matters much, since the multicall itself can't be
preempted (can it?) and the rest of the code is very short. Originally
it disabled interrupts for the entire lazy section, which is obviously
worse.

> There's no reason to do any flush at all if you suppress batching temporarily.
> And it only needs (would need) explicit suppressing here because you can't
> easily recognize being in the context of a page fault handler from the
> batching functions (other than recognizing being in the context of an
> interrupt handler, which is what would allow removing the flush calls from
> highmem_32.c).

I'm not sure what your concern is here. If batching is currently
enabled, then the flush will push out anything pending immediately. If
batching is disabled, then the flush will be a noop and return immediately.

Making it so that the batching state can nest or have some way to push
unbatched updates past the current batch would work as mechanisms, but I
don't see what advantage there is to doing it, especially to offset the
increased complexity.

J

2008-11-18 17:28:23

by Jan Beulich

[permalink] [raw]
Subject: Re: arch_flush_lazy_mmu_mode() in arch/x86/mm/highmem_32.c

>>> Jeremy Fitzhardinge <[email protected]> 18.11.08 18:01 >>>
>Yes, it disables interrupts while its actually issuing the multicall. I
>don't think that matters much, since the multicall itself can't be
>preempted (can it?) and the rest of the code is very short. Originally
>it disabled interrupts for the entire lazy section, which is obviously
>worse.

If an interrupt (event) comes in, a multicall could of course be 'preempted',
in order to service the event. But of course that works only if event
delivery isn't disabled.

>> There's no reason to do any flush at all if you suppress batching temporarily.
>> And it only needs (would need) explicit suppressing here because you can't
>> easily recognize being in the context of a page fault handler from the
>> batching functions (other than recognizing being in the context of an
>> interrupt handler, which is what would allow removing the flush calls from
>> highmem_32.c).
>
>I'm not sure what your concern is here. If batching is currently
>enabled, then the flush will push out anything pending immediately. If
>batching is disabled, then the flush will be a noop and return immediately.

Latency, as before. The page fault should have to take longer than it really
needs, and the flushing of a pending batch clearly doesn't belong to the
page fault itself.

Jan

2008-11-18 18:00:59

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: arch_flush_lazy_mmu_mode() in arch/x86/mm/highmem_32.c

Jan Beulich wrote:
> If an interrupt (event) comes in, a multicall could of course be 'preempted',
> in order to service the event. But of course that works only if event
> delivery isn't disabled.
>

Well, if we made a local copy of the multicall queue, we could enable
interrupts before doing the multicall.

>>> There's no reason to do any flush at all if you suppress batching temporarily.
>>> And it only needs (would need) explicit suppressing here because you can't
>>> easily recognize being in the context of a page fault handler from the
>>> batching functions (other than recognizing being in the context of an
>>> interrupt handler, which is what would allow removing the flush calls from
>>> highmem_32.c).
>>>
>> I'm not sure what your concern is here. If batching is currently
>> enabled, then the flush will push out anything pending immediately. If
>> batching is disabled, then the flush will be a noop and return immediately.
>>
>
> Latency, as before. The page fault should have to take longer than it really
> needs, and the flushing of a pending batch clearly doesn't belong to the
> page fault itself.
>

Yes, I can see that. But in practice the batches are pretty small; the
cap is only 32 entries to start with, and there are very few operations
which can really get close to that. Large mprotects and munmaps of
present pages are the only way to make large batches, and they're
uncommon and expensive operations anyway.

J

2008-11-18 18:07:42

by Zachary Amsden

[permalink] [raw]
Subject: Re: arch_flush_lazy_mmu_mode() in arch/x86/mm/highmem_32.c

On Tue, 2008-11-18 at 09:28 -0800, Jan Beulich wrote:
> >>> Jeremy Fitzhardinge <[email protected]> 18.11.08 18:01 >>>

> Latency, as before. The page fault should have to take longer than it really
> needs, and the flushing of a pending batch clearly doesn't belong to the
> page fault itself.

Page faults for vmalloc area syncing are extremely rare to begin with,
and only happen on non-PAE kernels (although, perhaps on Xen in
PAE-mode, since the PMD isn't fully shared). Latency isn't an issue
there.

Latency could be added on interrupts which somehow end up in a
kmap_atomic path, but there are very restricted uses of this; glancing
around, I see ide_io_buffers, aio, USB DMA peeking, bounce buffers,
memory sticks, NTFS, a couple SCSI drivers...

Most of these are doing things like PIO or data copy... I'm sure there
are some hot paths here such as aio, but do you really see an issue with
potentially having to process 32 queued multicalls, I mean the latency
can't be that high? Do you have any statistics that show this latency
to be a problem?

Our measurements show that the lazy mode batching rarely gets more than
a couple updates; every once in a while, you might get a blob of 32 or
so, but in the common case, there are typically only a few updates. I
really can't realistically imagine a scenario where it would measurably
affect performance to have to issue a typically small flush in the
already rare case that you happen to take an interrupt in a MMU batching
region...

This whole thing is already pretty tricky to get right and one could
even say a bit fragile. It's been a problematic source of bugs in the
past. I don't see how making it more complex than it already is, is
going to help anyone. If anything, we should be looking to simplify
it..

Zach