On Mon, 2009-06-15 at 14:07 +0000, tip-bot for Peter Zijlstra wrote:
> Commit-ID: 3ff0141aa3a03ca3388b40b36167d0a37919f3fd
> Gitweb: http://git.kernel.org/tip/3ff0141aa3a03ca3388b40b36167d0a37919f3fd
> Author: Peter Zijlstra <[email protected]>
> AuthorDate: Mon, 15 Jun 2009 12:40:41 +0200
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Mon, 15 Jun 2009 15:57:52 +0200
>
> x86: Add NMI types for kmap_atomic
>
> Two new kmap_atomic slots for NMI context. And teach pte_offset_map()
> about NMI context.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> CC: Nick Piggin <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
>
>
> ---
> arch/x86/include/asm/kmap_types.h | 4 +++-
> arch/x86/include/asm/pgtable_32.h | 5 +++--
> 2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/kmap_types.h b/arch/x86/include/asm/kmap_types.h
> index 5759c16..ff00a44 100644
> --- a/arch/x86/include/asm/kmap_types.h
> +++ b/arch/x86/include/asm/kmap_types.h
> @@ -21,7 +21,9 @@ D(9) KM_IRQ0,
> D(10) KM_IRQ1,
> D(11) KM_SOFTIRQ0,
> D(12) KM_SOFTIRQ1,
> -D(13) KM_TYPE_NR
> +D(13) KM_NMI,
> +D(14) KM_NMI_PTE,
> +D(15) KM_TYPE_NR
> };
>
> #undef D
> diff --git a/arch/x86/include/asm/pgtable_32.h b/arch/x86/include/asm/pgtable_32.h
> index 31bd120..8546497 100644
> --- a/arch/x86/include/asm/pgtable_32.h
> +++ b/arch/x86/include/asm/pgtable_32.h
> @@ -49,13 +49,14 @@ extern void set_pmd_pfn(unsigned long, unsigned long, pgprot_t);
> #endif
>
> #if defined(CONFIG_HIGHPTE)
> +#define __KM_PTE (in_nmi() ? KM_NMI_PTE : KM_PTE0)
> #define pte_offset_map(dir, address) \
> - ((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE0) + \
> + ((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), __KM_PTE) + \
> pte_index((address)))
> #define pte_offset_map_nested(dir, address) \
> ((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), KM_PTE1) + \
> pte_index((address)))
> -#define pte_unmap(pte) kunmap_atomic((pte), KM_PTE0)
> +#define pte_unmap(pte) kunmap_atomic((pte), __KM_PTE)
> #define pte_unmap_nested(pte) kunmap_atomic((pte), KM_PTE1)
> #else
> #define pte_offset_map(dir, address) \
I just realized this has a kmap_atomic bug in...
The below would fix it, but that's getting rather ugly :-/,
alternatively I would have to introduce something like
pte_offset_map_irq() which would make the irq/nmi detection and leave
the regular code paths alone, however that would mean either duplicating
the gup_fast() pagewalk or passing down a pte function pointer, which
would only duplicate the gup_pte_range() bit, neither is really
attractive...
Index: linux-2.6/arch/x86/include/asm/kmap_types.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/kmap_types.h
+++ linux-2.6/arch/x86/include/asm/kmap_types.h
@@ -19,11 +19,12 @@ D(7) KM_PTE0,
D(8) KM_PTE1,
D(9) KM_IRQ0,
D(10) KM_IRQ1,
-D(11) KM_SOFTIRQ0,
-D(12) KM_SOFTIRQ1,
-D(13) KM_NMI,
-D(14) KM_NMI_PTE,
-D(15) KM_TYPE_NR
+D(11) KM_IRQ_PTE,
+D(12) KM_SOFTIRQ0,
+D(13) KM_SOFTIRQ1,
+D(14) KM_NMI,
+D(15) KM_NMI_PTE,
+D(16) KM_TYPE_NR
};
#undef D
Index: linux-2.6/arch/x86/include/asm/pgtable_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pgtable_32.h
+++ linux-2.6/arch/x86/include/asm/pgtable_32.h
@@ -49,7 +49,10 @@ extern void set_pmd_pfn(unsigned long, u
#endif
#if defined(CONFIG_HIGHPTE)
-#define __KM_PTE (in_nmi() ? KM_NMI_PTE : KM_PTE0)
+#define __KM_PTE \
+ (in_nmi() ? KM_NMI_PTE : \
+ in_irq() ? KM_IRQ_PTE : \
+ KM_PTE0)
#define pte_offset_map(dir, address) \
((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), __KM_PTE) + \
pte_index((address)))
On Mon, 15 Jun 2009, Peter Zijlstra wrote:
>
> The below would fix it, but that's getting rather ugly :-/,
> alternatively I would have to introduce something like
> pte_offset_map_irq() which would make the irq/nmi detection and leave
> the regular code paths alone, however that would mean either duplicating
> the gup_fast() pagewalk or passing down a pte function pointer, which
> would only duplicate the gup_pte_range() bit, neither is really
> attractive...
> Index: linux-2.6/arch/x86/include/asm/pgtable_32.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/pgtable_32.h
> +++ linux-2.6/arch/x86/include/asm/pgtable_32.h
> @@ -49,7 +49,10 @@ extern void set_pmd_pfn(unsigned long, u
> #endif
>
> #if defined(CONFIG_HIGHPTE)
> -#define __KM_PTE (in_nmi() ? KM_NMI_PTE : KM_PTE0)
> +#define __KM_PTE \
> + (in_nmi() ? KM_NMI_PTE : \
> + in_irq() ? KM_IRQ_PTE : \
> + KM_PTE0)
> #define pte_offset_map(dir, address) \
> ((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), __KM_PTE) + \
> pte_index((address)))
Yes, that does look ugly!
I've not been following the background to this, but I've often
wondered if a kmap_push() and kmap_pop() could be useful,
allowing you to reuse the slot in between - any use here?
Hugh
On Mon, 2009-06-15 at 16:30 +0100, Hugh Dickins wrote:
> On Mon, 15 Jun 2009, Peter Zijlstra wrote:
> >
> > The below would fix it, but that's getting rather ugly :-/,
> > alternatively I would have to introduce something like
> > pte_offset_map_irq() which would make the irq/nmi detection and leave
> > the regular code paths alone, however that would mean either duplicating
> > the gup_fast() pagewalk or passing down a pte function pointer, which
> > would only duplicate the gup_pte_range() bit, neither is really
> > attractive...
>
> > Index: linux-2.6/arch/x86/include/asm/pgtable_32.h
> > ===================================================================
> > --- linux-2.6.orig/arch/x86/include/asm/pgtable_32.h
> > +++ linux-2.6/arch/x86/include/asm/pgtable_32.h
> > @@ -49,7 +49,10 @@ extern void set_pmd_pfn(unsigned long, u
> > #endif
> >
> > #if defined(CONFIG_HIGHPTE)
> > -#define __KM_PTE (in_nmi() ? KM_NMI_PTE : KM_PTE0)
> > +#define __KM_PTE \
> > + (in_nmi() ? KM_NMI_PTE : \
> > + in_irq() ? KM_IRQ_PTE : \
> > + KM_PTE0)
> > #define pte_offset_map(dir, address) \
> > ((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), __KM_PTE) + \
> > pte_index((address)))
>
> Yes, that does look ugly!
>
> I've not been following the background to this,
We need/want to do a user-space stack walk from IRQ/NMI context. The NMI
bit means we cannot simply use __copy_from_user_inatomic() since that
will still fault (albeit not page), and the fault return path invokes
IRET which will terminate the NMI context.
Therefore I wrote a copy_from_user_nmi() variant that is based of of
__get_user_pages_fast() (a variant that doesn't fall back to the regular
GUP), but that means we get 2 kmap_atomic()s, one for HIGHPTE and one
for the user page.
So this introduces the pte map from IRQ context and one from NMI
context.
> but I've often
> wondered if a kmap_push() and kmap_pop() could be useful,
> allowing you to reuse the slot in between - any use here?
Yes, that would be much nicer, although less we would loose some of the
type validation that lives in -mm, (along with a massive overhaul of the
current kmap_atomic usage).
Hmm, if we give each explicit type an level and ensure the new push()'ed
type's level <= the previous one, we'd still have the full nesting
validation and such..
I'll look into doing this.
* Peter Zijlstra <[email protected]> wrote:
> > I've not been following the background to this,
>
> We need/want to do a user-space stack walk from IRQ/NMI context.
> The NMI bit means we cannot simply use __copy_from_user_inatomic()
> since that will still fault (albeit not page), and the fault
> return path invokes IRET which will terminate the NMI context.
The goal is to allow 'perf' (see tools/perf/) non-flat
categorizations like the sample output in the (pending) commit (see
it attached further below). Here's the kind of output it allows:
$ perf record -g -m 512 -f -- make -j32 kernel
$ perf report -s s --syscalls | grep '\[k\]' | grep -v nmi
4.14% [k] do_page_fault
1.20% [k] sys_write
1.10% [k] sys_open
0.63% [k] sys_exit_group
0.48% [k] smp_apic_timer_interrupt
0.37% [k] sys_read
0.37% [k] sys_execve
0.20% [k] sys_mmap
0.18% [k] sys_close
0.14% [k] sys_munmap
0.13% [k] sys_poll
Note that Oprofile uses the same method of __copy_user_inatomic() in
arch/x86/oprofile/backtrace.c, but i believe that code is broken - i
doubt the call-chain support for user-space stacks ever worked in
oprofile - with perfcounters i can make this method crash under
load. (we re-enter the NMI which due to IST executes over the exact
same, still pending NMI frame. Kaboom.)
I saw you being involved with the Oprofile code 3 years ago:
| commit c34d1b4d165c67b966bca4aba026443d7ff161eb
| Author: Hugh Dickins <[email protected]>
| Date: Sat Oct 29 18:16:32 2005 -0700
|
| [PATCH] mm: kill check_user_page_readable
That method of __copy_user_inatomic(), while elegant, is subtly
wrong in an NMI context. We really must avoid taking faults there.
Ingo
------------>
>From 3dfabc74c65904c9e6cf952391312d16ea772ef5 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Mon, 15 Jun 2009 11:24:38 +0200
Subject: [PATCH] perf report: Add per system call overhead histogram
Take advantage of call-graph percounter sampling/recording to
display a non-trivial histogram: the true, collapsed/summarized
cost measurement, on a per system call total overhead basis:
aldebaran:~/linux/linux/tools/perf> ./perf record -g -a -f ~/hackbench 10
aldebaran:~/linux/linux/tools/perf> ./perf report -s symbol --syscalls | head -10
#
# (3536 samples)
#
# Overhead Symbol
# ........ ......
#
40.75% [k] sys_write
40.21% [k] sys_read
4.44% [k] do_nmi
...
This is done by accounting each (reliable) call-chain that chains back
to a given system call to that system call function.
[ So in the above example we can see that hackbench spends about 40% of
its total time somewhere in sys_write() and 40% somewhere in
sys_read(), the rest of the time is spent in user-space. The time
is not spent in sys_write() _itself_ but in one of its many child
functions. ]
Or, a recording of a (source files are already in the page-cache) kernel build:
$ perf record -g -m 512 -f -- make -j32 kernel
$ perf report -s s --syscalls | grep '\[k\]' | grep -v nmi
4.14% [k] do_page_fault
1.20% [k] sys_write
1.10% [k] sys_open
0.63% [k] sys_exit_group
0.48% [k] smp_apic_timer_interrupt
0.37% [k] sys_read
0.37% [k] sys_execve
0.20% [k] sys_mmap
0.18% [k] sys_close
0.14% [k] sys_munmap
0.13% [k] sys_poll
0.09% [k] sys_newstat
0.07% [k] sys_clone
0.06% [k] sys_newfstat
0.05% [k] sys_access
0.05% [k] schedule
Shows the true total cost of each syscall variant that gets used
during a kernel build. This profile reveals it that pagefaults are
the costliest, followed by read()/write().
An interesting detail: timer interrupts cost 0.5% - or 0.5 seconds
per 100 seconds of kernel build-time. (this was done with HZ=1000)
The summary is done in 'perf report', i.e. in the post-processing
stage - so once we have a good call-graph recording, this type of
non-trivial high-level analysis becomes possible.
Cc: Peter Zijlstra <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Pekka Enberg <[email protected]>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <[email protected]>
---
tools/perf/builtin-report.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index aebba56..1e2f5dd 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -40,6 +40,7 @@ static int dump_trace = 0;
static int verbose;
static int full_paths;
+static int collapse_syscalls;
static unsigned long page_size;
static unsigned long mmap_window = 32;
@@ -983,6 +984,15 @@ process_overflow_event(event_t *event, unsigned long offset, unsigned long head)
for (i = 0; i < chain->nr; i++)
dprintf("..... %2d: %p\n", i, (void *)chain->ips[i]);
}
+ if (collapse_syscalls) {
+ /*
+ * Find the all-but-last kernel entry
+ * amongst the call-chains - to get
+ * to the level of system calls:
+ */
+ if (chain->kernel >= 2)
+ ip = chain->ips[chain->kernel-2];
+ }
}
dprintf(" ... thread: %s:%d\n", thread->comm, thread->pid);
@@ -1343,6 +1353,8 @@ static const struct option options[] = {
"sort by key(s): pid, comm, dso, symbol. Default: pid,symbol"),
OPT_BOOLEAN('P', "full-paths", &full_paths,
"Don't shorten the pathnames taking into account the cwd"),
+ OPT_BOOLEAN('S', "syscalls", &collapse_syscalls,
+ "show per syscall summary overhead, using call graph"),
OPT_END()
};
On Mon, 15 Jun 2009, Ingo Molnar wrote:
>
> Note that Oprofile uses the same method of __copy_user_inatomic() in
> arch/x86/oprofile/backtrace.c, but i believe that code is broken - i
> doubt the call-chain support for user-space stacks ever worked in
> oprofile - with perfcounters i can make this method crash under
> load. (we re-enter the NMI which due to IST executes over the exact
> same, still pending NMI frame. Kaboom.)
>
> I saw you being involved with the Oprofile code 3 years ago:
>
> | commit c34d1b4d165c67b966bca4aba026443d7ff161eb
> | Author: Hugh Dickins <[email protected]>
> | Date: Sat Oct 29 18:16:32 2005 -0700
> |
> | [PATCH] mm: kill check_user_page_readable
>
> That method of __copy_user_inatomic(), while elegant, is subtly
> wrong in an NMI context. We really must avoid taking faults there.
Yes, I'm afraid that subtlety escaped me - thanks for explaining.
Hugh
On Mon, 2009-06-15 at 17:41 +0200, Peter Zijlstra wrote:
> On Mon, 2009-06-15 at 16:30 +0100, Hugh Dickins wrote:
> > On Mon, 15 Jun 2009, Peter Zijlstra wrote:
> > >
> > > The below would fix it, but that's getting rather ugly :-/,
> > > alternatively I would have to introduce something like
> > > pte_offset_map_irq() which would make the irq/nmi detection and leave
> > > the regular code paths alone, however that would mean either duplicating
> > > the gup_fast() pagewalk or passing down a pte function pointer, which
> > > would only duplicate the gup_pte_range() bit, neither is really
> > > attractive...
> >
> > > Index: linux-2.6/arch/x86/include/asm/pgtable_32.h
> > > ===================================================================
> > > --- linux-2.6.orig/arch/x86/include/asm/pgtable_32.h
> > > +++ linux-2.6/arch/x86/include/asm/pgtable_32.h
> > > @@ -49,7 +49,10 @@ extern void set_pmd_pfn(unsigned long, u
> > > #endif
> > >
> > > #if defined(CONFIG_HIGHPTE)
> > > -#define __KM_PTE (in_nmi() ? KM_NMI_PTE : KM_PTE0)
> > > +#define __KM_PTE \
> > > + (in_nmi() ? KM_NMI_PTE : \
> > > + in_irq() ? KM_IRQ_PTE : \
> > > + KM_PTE0)
> > > #define pte_offset_map(dir, address) \
> > > ((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), __KM_PTE) + \
> > > pte_index((address)))
> >
> > Yes, that does look ugly!
> >
> > I've not been following the background to this,
>
> We need/want to do a user-space stack walk from IRQ/NMI context. The NMI
> bit means we cannot simply use __copy_from_user_inatomic() since that
> will still fault (albeit not page), and the fault return path invokes
> IRET which will terminate the NMI context.
>
> Therefore I wrote a copy_from_user_nmi() variant that is based of of
> __get_user_pages_fast() (a variant that doesn't fall back to the regular
> GUP), but that means we get 2 kmap_atomic()s, one for HIGHPTE and one
> for the user page.
>
> So this introduces the pte map from IRQ context and one from NMI
> context.
>
> > but I've often
> > wondered if a kmap_push() and kmap_pop() could be useful,
> > allowing you to reuse the slot in between - any use here?
>
> Yes, that would be much nicer, although less we would loose some of the
> type validation that lives in -mm, (along with a massive overhaul of the
> current kmap_atomic usage).
>
> Hmm, if we give each explicit type an level and ensure the new push()'ed
> type's level <= the previous one, we'd still have the full nesting
> validation and such..
>
> I'll look into doing this.
OK, utterly untested, but how does this look?
Not-Signed-off-by: Peter Zijlstra <[email protected]>
---
diff --git a/arch/x86/include/asm/kmap_types.h b/arch/x86/include/asm/kmap_types.h
index ff00a44..f866138 100644
--- a/arch/x86/include/asm/kmap_types.h
+++ b/arch/x86/include/asm/kmap_types.h
@@ -19,11 +19,12 @@ D(7) KM_PTE0,
D(8) KM_PTE1,
D(9) KM_IRQ0,
D(10) KM_IRQ1,
-D(11) KM_SOFTIRQ0,
-D(12) KM_SOFTIRQ1,
-D(13) KM_NMI,
-D(14) KM_NMI_PTE,
-D(15) KM_TYPE_NR
+D(11) KM_IRQ_PTE,
+D(12) KM_SOFTIRQ0,
+D(13) KM_SOFTIRQ1,
+D(14) KM_NMI,
+D(15) KM_NMI_PTE,
+D(16) KM_TYPE_NR
};
#undef D
diff --git a/arch/x86/include/asm/pgtable_32.h b/arch/x86/include/asm/pgtable_32.h
index 8546497..d31930a 100644
--- a/arch/x86/include/asm/pgtable_32.h
+++ b/arch/x86/include/asm/pgtable_32.h
@@ -49,7 +49,20 @@ extern void set_pmd_pfn(unsigned long, unsigned long, pgprot_t);
#endif
#if defined(CONFIG_HIGHPTE)
-#define __KM_PTE (in_nmi() ? KM_NMI_PTE : KM_PTE0)
+
+#ifdef CONFIG_DEBUG_VM
+/*
+ * for debug we need the right type so that we can validate context
+ * nesting
+ */
+#define __KM_PTE \
+ (in_nmi() ? KM_NMI_PTE : \
+ in_irq() ? KM_IRQ_PTE : \
+ KM_PTE0)
+#else
+#define __KM_PTE KM_PTE0
+#endif
+
#define pte_offset_map(dir, address) \
((pte_t *)kmap_atomic_pte(pmd_page(*(dir)), __KM_PTE) + \
pte_index((address)))
diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index 58f621e..62d15f7 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -19,6 +19,147 @@ void kunmap(struct page *page)
kunmap_high(page);
}
+struct kmap_atomic_state {
+ int slot;
+#ifdef CONFIG_DEBUG_VM
+ int taken[KM_TYPE_NR];
+ int context;
+#endif
+};
+
+#ifdef CONFIG_DEBUG_VM
+enum km_context {
+ KM_CTX_USER,
+ KM_CTX_SOFTIRQ,
+ KM_CTX_IRQ,
+ KM_CTX_NMI,
+
+ KM_CTX_MAX
+};
+
+static int kmap_type_to_context(enum km_type type)
+{
+ switch (type) {
+ case KM_BOUNCE_READ:
+ return KM_CTX_USER;
+ case KM_SKB_SUNRPC_DATA:
+ return KM_CTX_USER;
+ case KM_SKB_DATA_SOFTIRQ:
+ return KM_CTX_SOFTIRQ;
+ case KM_USER0:
+ return KM_CTX_USER;
+ case KM_USER1:
+ return KM_CTX_USER;
+ case KM_BIO_SRC_IRQ:
+ return KM_CTX_IRQ;
+ case KM_BIO_DST_IRQ:
+ return KM_CTX_IRQ;
+ case KM_PTE0:
+ return KM_CTX_USER;
+ case KM_PTE1:
+ return KM_CTX_USER;
+ case KM_IRQ0:
+ return KM_CTX_IRQ;
+ case KM_IRQ1:
+ return KM_CTX_IRQ;
+ case KM_SOFTIRQ0:
+ return KM_CTX_SOFTIRQ;
+ case KM_SOFTIRQ1:
+ return KM_CTX_SOFTIRQ;
+ case KM_NMI:
+ return KM_CTX_NMI;
+ case KM_NMI_PTE:
+ return KM_CTX_NMI;
+ }
+
+ return KM_CTX_MAX;
+}
+
+static void
+kmap_atomic_push_debug(struct kmap_atomic_state *state, enum km_type type)
+{
+ int context = kmap_type_to_context(type);
+ int i;
+
+ for (i = 0; i < state->slot; i++)
+ WARN_ON_ONCE(state->taken[i] == type);
+
+ WARN_ON_ONCE(state->context > context);
+
+ if (context > state->context)
+ state->context = context;
+
+ switch (context) {
+ case KM_CTX_USER:
+ WARN_ON_ONCE(in_irq());
+ WARN_ON_ONCE(in_nmi());
+ break;
+
+ case KM_CTX_SOFTIRQ:
+ WARN_ON_ONCE(in_irq());
+ WARN_ON_ONCE(in_nmi());
+ break;
+
+ case KM_CTX_IRQ:
+ WARN_ON_ONCE(in_nmi());
+ break;
+
+ case KM_CTX_NMI:
+ break;
+
+ case KM_CTX_MAX:
+ WARN_ON_ONCE(1);
+ break;
+ }
+}
+
+static void
+kmap_atomic_pop_debug(struct kmap_atomic_state *state, enum km_type type)
+{
+ WARN_ON_ONCE(state->taken[state->slot] != type);
+
+ if (!state->slot) {
+ state->context = KM_CTX_USER;
+ return;
+ }
+
+ state->context = kmap_type_to_context(state->taken[state->slot - 1]);
+}
+
+#else /* !CONFIG_DEBUG_VM */
+static inline void
+kmap_atomic_push_debug(struct kmap_atomic_state *state, enum km_type type)
+{
+}
+
+static inline void
+kmap_atomic_pop_debug(struct kmap_atomic_state *state, enum km_type type)
+{
+}
+#endif
+
+static DEFINE_PER_CPU(struct kmap_atomic_state, kmap_state);
+
+static int kmap_atomic_push(enum km_type type)
+{
+ struct kmap_atomic_state *state = &per_cpu(kmap_state);
+
+ kmap_atomic_push_debug(state, type);
+
+ return state->slot++;
+}
+
+static int kmap_atomic_pop(enum km_type type)
+{
+ struct kmap_atomic_state *state = &per_cpu(kmap_state);
+
+ state->slot--;
+
+ kmap_atomic_pop_debug(state, type);
+
+ return state->slot;
+}
+
/*
* kmap_atomic/kunmap_atomic is significantly faster than kmap/kunmap because
* no global lock is needed and because the kmap code must perform a global TLB
@@ -38,9 +179,7 @@ void *kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot)
if (!PageHighMem(page))
return page_address(page);
- debug_kmap_atomic(type);
-
- idx = type + KM_TYPE_NR*smp_processor_id();
+ idx = KM_TYPE_NR*smp_processor_id() + kmap_atomic_push(type);
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
BUG_ON(!pte_none(*(kmap_pte-idx)));
set_pte(kmap_pte-idx, mk_pte(page, prot));
@@ -56,8 +195,9 @@ void *kmap_atomic(struct page *page, enum km_type type)
void kunmap_atomic(void *kvaddr, enum km_type type)
{
unsigned long vaddr = (unsigned long) kvaddr & PAGE_MASK;
- enum fixed_addresses idx = type + KM_TYPE_NR*smp_processor_id();
+ enum fixed_addresses idx = KM_TYPE_NR*smp_processor_id();
+ idx += kmap_atomic_pop(type);
/*
* Force other mappings to Oops if they'll try to access this pte
* without first remap it. Keeping stale mappings around is a bad idea
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 1fcb712..92123b9 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -21,18 +21,6 @@ static inline void flush_kernel_dcache_page(struct page *page)
#include <asm/kmap_types.h>
-#if defined(CONFIG_DEBUG_HIGHMEM) && defined(CONFIG_TRACE_IRQFLAGS_SUPPORT)
-
-void debug_kmap_atomic(enum km_type type);
-
-#else
-
-static inline void debug_kmap_atomic(enum km_type type)
-{
-}
-
-#endif
-
#ifdef CONFIG_HIGHMEM
#include <asm/highmem.h>
diff --git a/mm/highmem.c b/mm/highmem.c
index 68eb1d9..9101980 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -422,48 +422,3 @@ void __init page_address_init(void)
}
#endif /* defined(CONFIG_HIGHMEM) && !defined(WANT_PAGE_VIRTUAL) */
-
-#if defined(CONFIG_DEBUG_HIGHMEM) && defined(CONFIG_TRACE_IRQFLAGS_SUPPORT)
-
-void debug_kmap_atomic(enum km_type type)
-{
- static unsigned warn_count = 10;
-
- if (unlikely(warn_count == 0))
- return;
-
- if (unlikely(in_interrupt())) {
- if (in_irq()) {
- if (type != KM_IRQ0 && type != KM_IRQ1 &&
- type != KM_BIO_SRC_IRQ && type != KM_BIO_DST_IRQ &&
- type != KM_BOUNCE_READ) {
- WARN_ON(1);
- warn_count--;
- }
- } else if (!irqs_disabled()) { /* softirq */
- if (type != KM_IRQ0 && type != KM_IRQ1 &&
- type != KM_SOFTIRQ0 && type != KM_SOFTIRQ1 &&
- type != KM_SKB_SUNRPC_DATA &&
- type != KM_SKB_DATA_SOFTIRQ &&
- type != KM_BOUNCE_READ) {
- WARN_ON(1);
- warn_count--;
- }
- }
- }
-
- if (type == KM_IRQ0 || type == KM_IRQ1 || type == KM_BOUNCE_READ ||
- type == KM_BIO_SRC_IRQ || type == KM_BIO_DST_IRQ) {
- if (!irqs_disabled()) {
- WARN_ON(1);
- warn_count--;
- }
- } else if (type == KM_SOFTIRQ0 || type == KM_SOFTIRQ1) {
- if (irq_count() == 0 && !irqs_disabled()) {
- WARN_ON(1);
- warn_count--;
- }
- }
-}
-
-#endif
* Peter Zijlstra <[email protected]> wrote:
> +static int kmap_type_to_context(enum km_type type)
> +{
> + switch (type) {
> + case KM_BOUNCE_READ:
> + return KM_CTX_USER;
> + case KM_SKB_SUNRPC_DATA:
> + return KM_CTX_USER;
> + case KM_SKB_DATA_SOFTIRQ:
> + return KM_CTX_SOFTIRQ;
> + case KM_USER0:
> + return KM_CTX_USER;
> + case KM_USER1:
> + return KM_CTX_USER;
> + case KM_BIO_SRC_IRQ:
> + return KM_CTX_IRQ;
> + case KM_BIO_DST_IRQ:
> + return KM_CTX_IRQ;
> + case KM_PTE0:
> + return KM_CTX_USER;
> + case KM_PTE1:
> + return KM_CTX_USER;
> + case KM_IRQ0:
> + return KM_CTX_IRQ;
> + case KM_IRQ1:
> + return KM_CTX_IRQ;
> + case KM_SOFTIRQ0:
> + return KM_CTX_SOFTIRQ;
> + case KM_SOFTIRQ1:
> + return KM_CTX_SOFTIRQ;
> + case KM_NMI:
> + return KM_CTX_NMI;
> + case KM_NMI_PTE:
> + return KM_CTX_NMI;
> + }
> +
> + return KM_CTX_MAX;
why not do a very simple stack of atomic kmaps, like Hugh suggested?
That would mean a much simpler interface:
kaddr = kmap_atomic(page);
no index needed. The kunmap pops the entry off the stack:
kunmap_atomic(kaddr);
This becomes simpler too.
Now, a stack can be overflown by imbalance - but that's easy to
detect and existing entries are easily printed and thus the source
of the leak is easily identified.
In my book this design beats the current enumeration of kmap types
indices hands down ... It would likely be much more robust as well,
and much more easy to extend.
Am i missing any subtlety?
Ingo
On Mon, 2009-06-15 at 20:15 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > +static int kmap_type_to_context(enum km_type type)
> > +{
> > + switch (type) {
> > + case KM_BOUNCE_READ:
> > + return KM_CTX_USER;
> > + case KM_SKB_SUNRPC_DATA:
> > + return KM_CTX_USER;
> > + case KM_SKB_DATA_SOFTIRQ:
> > + return KM_CTX_SOFTIRQ;
> > + case KM_USER0:
> > + return KM_CTX_USER;
> > + case KM_USER1:
> > + return KM_CTX_USER;
> > + case KM_BIO_SRC_IRQ:
> > + return KM_CTX_IRQ;
> > + case KM_BIO_DST_IRQ:
> > + return KM_CTX_IRQ;
> > + case KM_PTE0:
> > + return KM_CTX_USER;
> > + case KM_PTE1:
> > + return KM_CTX_USER;
> > + case KM_IRQ0:
> > + return KM_CTX_IRQ;
> > + case KM_IRQ1:
> > + return KM_CTX_IRQ;
> > + case KM_SOFTIRQ0:
> > + return KM_CTX_SOFTIRQ;
> > + case KM_SOFTIRQ1:
> > + return KM_CTX_SOFTIRQ;
> > + case KM_NMI:
> > + return KM_CTX_NMI;
> > + case KM_NMI_PTE:
> > + return KM_CTX_NMI;
> > + }
> > +
> > + return KM_CTX_MAX;
>
> why not do a very simple stack of atomic kmaps, like Hugh suggested?
>
> That would mean a much simpler interface:
>
> kaddr = kmap_atomic(page);
>
> no index needed. The kunmap pops the entry off the stack:
>
> kunmap_atomic(kaddr);
>
> This becomes simpler too.
>
> Now, a stack can be overflown by imbalance - but that's easy to
> detect and existing entries are easily printed and thus the source
> of the leak is easily identified.
>
> In my book this design beats the current enumeration of kmap types
> indices hands down ... It would likely be much more robust as well,
> and much more easy to extend.
>
> Am i missing any subtlety?
The above is mostly debug code used to validate the kmap_atomic
conditions.
KM_CTX_NMI nests in KM_CTX_IRQ nests in KM_CTX_SOFTIRQ nests in
KM_CTX_USER.
And validate that we indeed are in the context specified by the type.
That is, it will warn if we use KM_IRQ1 with KM_CTX_IRQ from user
context.
Some of this was already captured in the old kmap debug code which I
removed.
But yes, I should write that nicer..
/me goes clean up
* Peter Zijlstra <[email protected]> wrote:
> On Mon, 2009-06-15 at 20:15 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > +static int kmap_type_to_context(enum km_type type)
> > > +{
> > > + switch (type) {
> > > + case KM_BOUNCE_READ:
> > > + return KM_CTX_USER;
> > > + case KM_SKB_SUNRPC_DATA:
> > > + return KM_CTX_USER;
> > > + case KM_SKB_DATA_SOFTIRQ:
> > > + return KM_CTX_SOFTIRQ;
> > > + case KM_USER0:
> > > + return KM_CTX_USER;
> > > + case KM_USER1:
> > > + return KM_CTX_USER;
> > > + case KM_BIO_SRC_IRQ:
> > > + return KM_CTX_IRQ;
> > > + case KM_BIO_DST_IRQ:
> > > + return KM_CTX_IRQ;
> > > + case KM_PTE0:
> > > + return KM_CTX_USER;
> > > + case KM_PTE1:
> > > + return KM_CTX_USER;
> > > + case KM_IRQ0:
> > > + return KM_CTX_IRQ;
> > > + case KM_IRQ1:
> > > + return KM_CTX_IRQ;
> > > + case KM_SOFTIRQ0:
> > > + return KM_CTX_SOFTIRQ;
> > > + case KM_SOFTIRQ1:
> > > + return KM_CTX_SOFTIRQ;
> > > + case KM_NMI:
> > > + return KM_CTX_NMI;
> > > + case KM_NMI_PTE:
> > > + return KM_CTX_NMI;
> > > + }
> > > +
> > > + return KM_CTX_MAX;
> >
> > why not do a very simple stack of atomic kmaps, like Hugh suggested?
> >
> > That would mean a much simpler interface:
> >
> > kaddr = kmap_atomic(page);
> >
> > no index needed. The kunmap pops the entry off the stack:
> >
> > kunmap_atomic(kaddr);
> >
> > This becomes simpler too.
> >
> > Now, a stack can be overflown by imbalance - but that's easy to
> > detect and existing entries are easily printed and thus the source
> > of the leak is easily identified.
> >
> > In my book this design beats the current enumeration of kmap types
> > indices hands down ... It would likely be much more robust as well,
> > and much more easy to extend.
> >
> > Am i missing any subtlety?
>
> The above is mostly debug code used to validate the kmap_atomic
> conditions.
>
> KM_CTX_NMI nests in KM_CTX_IRQ nests in KM_CTX_SOFTIRQ nests in
> KM_CTX_USER.
>
> And validate that we indeed are in the context specified by the type.
> That is, it will warn if we use KM_IRQ1 with KM_CTX_IRQ from user
> context.
>
> Some of this was already captured in the old kmap debug code which I
> removed.
>
> But yes, I should write that nicer..
but ... look at the APIs i propose above. We dont need _any_
'types'.
That type enumeration is basically an open-coded allocator. If we do
a _real_ allocator (a balanced stack of atomic kmaps) we dont need
any of those indices, and all the potential for mismatch goes away
as well - a stack nests trivially with IRQ and NMI and arbitrary
other contexts.
Ingo
On Mon, 2009-06-15 at 20:25 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
> but ... look at the APIs i propose above. We dont need _any_
> 'types'.
>
> That type enumeration is basically an open-coded allocator. If we do
> a _real_ allocator (a balanced stack of atomic kmaps) we dont need
> any of those indices, and all the potential for mismatch goes away
> as well - a stack nests trivially with IRQ and NMI and arbitrary
> other contexts.
You want types because:
- they encode the intent, and can be verified
- they help keep track of the max nesting depth
In the proposed implementation all type code basically falls away no !
CONFIG_DEBUG_VM, but is kept around for robustness.
* Peter Zijlstra <[email protected]> wrote:
> On Mon, 2009-06-15 at 20:25 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
>
> > but ... look at the APIs i propose above. We dont need _any_
> > 'types'.
> >
> > That type enumeration is basically an open-coded allocator. If we do
> > a _real_ allocator (a balanced stack of atomic kmaps) we dont need
> > any of those indices, and all the potential for mismatch goes away
> > as well - a stack nests trivially with IRQ and NMI and arbitrary
> > other contexts.
>
> You want types because:
> - they encode the intent, and can be verified
> - they help keep track of the max nesting depth
>
> In the proposed implementation all type code basically falls away
> no ! CONFIG_DEBUG_VM, but is kept around for robustness.
But much of the fragility of the types (and their clumsiness - for
example in highpte ops we have to know at which level of the
pagetables we are, and use the right kind of index) is _precisely_
because we have the types ...
Unbalanced unmaps will be detected under CONFIG_DEBUG_VM: kunmap
uses 'page' as a parameter which is checked against the pte entry -
they must match.
I.e. it becomes a symmetric and expressive API:
kaddr = kmap_atomic(page);
...
kunmap_atomic(page);
Hm?
Ingo
On Mon, 15 Jun 2009 20:04:25 +0200
Peter Zijlstra <[email protected]> wrote:
> OK, utterly untested, but how does this look?
arch/x86/include/asm/kmap_types.h | 11 +-
arch/x86/include/asm/pgtable_32.h | 15 +++
arch/x86/mm/highmem_32.c | 148 ++++++++++++++++++++++++++++++++++++--
include/linux/highmem.h | 12 ---
mm/highmem.c | 45 -----------
5 files changed, 164 insertions(+), 67 deletions(-)
Did
http://userweb.kernel.org/~akpm/mmotm/broken-out/kmap_types-make-most-arches-use-generic-header-file.patch
just die a horrid death?
On Mon, 2009-06-15 at 11:42 -0700, Andrew Morton wrote:
> On Mon, 15 Jun 2009 20:04:25 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > OK, utterly untested, but how does this look?
>
> arch/x86/include/asm/kmap_types.h | 11 +-
> arch/x86/include/asm/pgtable_32.h | 15 +++
> arch/x86/mm/highmem_32.c | 148 ++++++++++++++++++++++++++++++++++++--
> include/linux/highmem.h | 12 ---
> mm/highmem.c | 45 -----------
> 5 files changed, 164 insertions(+), 67 deletions(-)
>
> Did
> http://userweb.kernel.org/~akpm/mmotm/broken-out/kmap_types-make-most-arches-use-generic-header-file.patch
> just die a horrid death?
hehe, yep, but I can fix that up..
On Mon, 2009-06-15 at 20:42 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Mon, 2009-06-15 at 20:25 +0200, Ingo Molnar wrote:
> > > * Peter Zijlstra <[email protected]> wrote:
> >
> > > but ... look at the APIs i propose above. We dont need _any_
> > > 'types'.
> > >
> > > That type enumeration is basically an open-coded allocator. If we do
> > > a _real_ allocator (a balanced stack of atomic kmaps) we dont need
> > > any of those indices, and all the potential for mismatch goes away
> > > as well - a stack nests trivially with IRQ and NMI and arbitrary
> > > other contexts.
> >
> > You want types because:
> > - they encode the intent, and can be verified
> > - they help keep track of the max nesting depth
> >
> > In the proposed implementation all type code basically falls away
> > no ! CONFIG_DEBUG_VM, but is kept around for robustness.
>
> But much of the fragility of the types (and their clumsiness - for
> example in highpte ops we have to know at which level of the
> pagetables we are, and use the right kind of index) is _precisely_
> because we have the types ...
How will you manage the max depth?
* Peter Zijlstra <[email protected]> wrote:
> On Mon, 2009-06-15 at 20:42 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Mon, 2009-06-15 at 20:25 +0200, Ingo Molnar wrote:
> > > > * Peter Zijlstra <[email protected]> wrote:
> > >
> > > > but ... look at the APIs i propose above. We dont need _any_
> > > > 'types'.
> > > >
> > > > That type enumeration is basically an open-coded allocator. If we do
> > > > a _real_ allocator (a balanced stack of atomic kmaps) we dont need
> > > > any of those indices, and all the potential for mismatch goes away
> > > > as well - a stack nests trivially with IRQ and NMI and arbitrary
> > > > other contexts.
> > >
> > > You want types because:
> > > - they encode the intent, and can be verified
> > > - they help keep track of the max nesting depth
> > >
> > > In the proposed implementation all type code basically falls away
> > > no ! CONFIG_DEBUG_VM, but is kept around for robustness.
> >
> > But much of the fragility of the types (and their clumsiness - for
> > example in highpte ops we have to know at which level of the
> > pagetables we are, and use the right kind of index) is _precisely_
> > because we have the types ...
>
> How will you manage the max depth?
if (++depth == MAX_DEPTH) {
print_all_entries_and_nasty_warning();
/* hope we'll live long enough for the syslog to touch disk */
depth = 0;
}
unbalanced kmap is a bad bug - the easier we make it to catch, the
better. The system wouldnt survive anyway.
Ingo
On Mon, 2009-06-15 at 20:52 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Mon, 2009-06-15 at 20:42 +0200, Ingo Molnar wrote:
> > > * Peter Zijlstra <[email protected]> wrote:
> > >
> > > > On Mon, 2009-06-15 at 20:25 +0200, Ingo Molnar wrote:
> > > > > * Peter Zijlstra <[email protected]> wrote:
> > > >
> > > > > but ... look at the APIs i propose above. We dont need _any_
> > > > > 'types'.
> > > > >
> > > > > That type enumeration is basically an open-coded allocator. If we do
> > > > > a _real_ allocator (a balanced stack of atomic kmaps) we dont need
> > > > > any of those indices, and all the potential for mismatch goes away
> > > > > as well - a stack nests trivially with IRQ and NMI and arbitrary
> > > > > other contexts.
> > > >
> > > > You want types because:
> > > > - they encode the intent, and can be verified
> > > > - they help keep track of the max nesting depth
> > > >
> > > > In the proposed implementation all type code basically falls away
> > > > no ! CONFIG_DEBUG_VM, but is kept around for robustness.
> > >
> > > But much of the fragility of the types (and their clumsiness - for
> > > example in highpte ops we have to know at which level of the
> > > pagetables we are, and use the right kind of index) is _precisely_
> > > because we have the types ...
> >
> > How will you manage the max depth?
>
> if (++depth == MAX_DEPTH) {
> print_all_entries_and_nasty_warning();
> /* hope we'll live long enough for the syslog to touch disk */
> depth = 0;
> }
That will only trigger if we hit it, which will be _very_ rare.
> unbalanced kmap is a bad bug - the easier we make it to catch, the
> better. The system wouldnt survive anyway.
My proposed patch validates strict balance of types. But I can easily
add the above as well.
By removing the types it becomes very difficult to verify the max depth.
I really don't like removing them.
* Peter Zijlstra <[email protected]> wrote:
> On Mon, 2009-06-15 at 20:52 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Mon, 2009-06-15 at 20:42 +0200, Ingo Molnar wrote:
> > > > * Peter Zijlstra <[email protected]> wrote:
> > > >
> > > > > On Mon, 2009-06-15 at 20:25 +0200, Ingo Molnar wrote:
> > > > > > * Peter Zijlstra <[email protected]> wrote:
> > > > >
> > > > > > but ... look at the APIs i propose above. We dont need _any_
> > > > > > 'types'.
> > > > > >
> > > > > > That type enumeration is basically an open-coded allocator. If we do
> > > > > > a _real_ allocator (a balanced stack of atomic kmaps) we dont need
> > > > > > any of those indices, and all the potential for mismatch goes away
> > > > > > as well - a stack nests trivially with IRQ and NMI and arbitrary
> > > > > > other contexts.
> > > > >
> > > > > You want types because:
> > > > > - they encode the intent, and can be verified
> > > > > - they help keep track of the max nesting depth
> > > > >
> > > > > In the proposed implementation all type code basically falls away
> > > > > no ! CONFIG_DEBUG_VM, but is kept around for robustness.
> > > >
> > > > But much of the fragility of the types (and their clumsiness - for
> > > > example in highpte ops we have to know at which level of the
> > > > pagetables we are, and use the right kind of index) is _precisely_
> > > > because we have the types ...
> > >
> > > How will you manage the max depth?
> >
> > if (++depth == MAX_DEPTH) {
> > print_all_entries_and_nasty_warning();
> > /* hope we'll live long enough for the syslog to touch disk */
> > depth = 0;
> > }
>
> That will only trigger if we hit it, which will be _very_ rare.
>
> > unbalanced kmap is a bad bug - the easier we make it to catch,
> > the better. The system wouldnt survive anyway.
>
> My proposed patch validates strict balance of types. But I can
> easily add the above as well.
>
> By removing the types it becomes very difficult to verify the max
> depth. I really don't like removing them.
The fact that it implies an atomic section pretty much limits its
depth in practice, doesnt it?
All we need to track in the debug code is
max-{syscall,softirq,hardirq,nmi}. The sum of these 4 counts must be
smaller than the max - even if (as you are right to point out) we
dont hit that magic combo that truly maximizes the depth.
And note that in practice many of the current types are exclusive to
each other - so using the stack would _reduce_ the amount of
kmap-atomic space we need.
Ingo
On Tue, 16 Jun 2009, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
> > On Mon, 2009-06-15 at 20:52 +0200, Ingo Molnar wrote:
> > > * Peter Zijlstra <[email protected]> wrote:
> > > > On Mon, 2009-06-15 at 20:42 +0200, Ingo Molnar wrote:
> > > > > * Peter Zijlstra <[email protected]> wrote:
> > > > > > On Mon, 2009-06-15 at 20:25 +0200, Ingo Molnar wrote:
> > > > > > > * Peter Zijlstra <[email protected]> wrote:
> > > > > >
> > > > > > > but ... look at the APIs i propose above. We dont need _any_
> > > > > > > 'types'.
> > > > > > >
> > > > > > > That type enumeration is basically an open-coded allocator. If we do
> > > > > > > a _real_ allocator (a balanced stack of atomic kmaps) we dont need
> > > > > > > any of those indices, and all the potential for mismatch goes away
> > > > > > > as well - a stack nests trivially with IRQ and NMI and arbitrary
> > > > > > > other contexts.
> > > > > >
> > > > > > You want types because:
> > > > > > - they encode the intent, and can be verified
> > > > > > - they help keep track of the max nesting depth
> > > > > >
> > > > > > In the proposed implementation all type code basically falls away
> > > > > > no ! CONFIG_DEBUG_VM, but is kept around for robustness.
> > > > >
> > > > > But much of the fragility of the types (and their clumsiness - for
> > > > > example in highpte ops we have to know at which level of the
> > > > > pagetables we are, and use the right kind of index) is _precisely_
> > > > > because we have the types ...
> > > >
> > > > How will you manage the max depth?
> > >
> > > if (++depth == MAX_DEPTH) {
> > > print_all_entries_and_nasty_warning();
> > > /* hope we'll live long enough for the syslog to touch disk */
> > > depth = 0;
> > > }
> >
> > That will only trigger if we hit it, which will be _very_ rare.
> >
> > > unbalanced kmap is a bad bug - the easier we make it to catch,
> > > the better. The system wouldnt survive anyway.
> >
> > My proposed patch validates strict balance of types. But I can
> > easily add the above as well.
> >
> > By removing the types it becomes very difficult to verify the max
> > depth. I really don't like removing them.
>
> The fact that it implies an atomic section pretty much limits its
> depth in practice, doesnt it?
>
> All we need to track in the debug code is
> max-{syscall,softirq,hardirq,nmi}. The sum of these 4 counts must be
> smaller than the max - even if (as you are right to point out) we
> dont hit that magic combo that truly maximizes the depth.
>
> And note that in practice many of the current types are exclusive to
> each other - so using the stack would _reduce_ the amount of
> kmap-atomic space we need.
I'll briefly resurface into the discussion before submerging again ;)
I like very much the direction you're taking this, Ingo.
Yes, that is how I've sometimes thought we should go - though when
making the kmap_push/kmap_pop suggestion to Peter yesterday, I wasn't
expecting him to make that revolution, just provide a way to save a
current KM_type mapping and restore it later, so he can safely use
the standard primitives like pte_offset_map() within.
I wasn't expecting in_nmi() and in_irq() tests still to be there,
even if only when debug. I can understand Peter's lockdep background
wanting to retain the checking and KM_types, but if we're actually
going to overhaul this area, I'd love just to get rid of them.
Yes, that should reduce the amount of kmap_atomic space needed;
though I've not thought how we keep track of the maximum needed
as the kernel goes on developing.
There might be a very few places where we expect to kmap_atomic A,
kmap_atomic B, kunmap_atomic A, kunmap_atomic B?
Something else to throw in: what if they were not just atomic,
but also replaced the current sleeping kmaps? i.e. a task context
carries around its own stack of these.
I've always rejected that as introducing a pretty terrible overhead
just where we don't want it; but maybe you're ingenious enough to
devise ways of amortizing that cost.
It would be nice to delete mm/highmem.c is we could. Ah, but there
are probably places where one task passes a kmap address to another?
Hugh
On Tue, 2009-06-16 at 10:13 +0200, Ingo Molnar wrote:
> > By removing the types it becomes very difficult to verify the max
> > depth. I really don't like removing them.
>
> The fact that it implies an atomic section pretty much limits its
> depth in practice, doesnt it?
>
> All we need to track in the debug code is
> max-{syscall,softirq,hardirq,nmi}. The sum of these 4 counts must be
> smaller than the max - even if (as you are right to point out) we
> dont hit that magic combo that truly maximizes the depth.
Right, so the thing I'd worry about is someone adding kmap_atomic() to
an interrupt context that didn't have interrupts disabled and then
managing to nest that a few times.
Suppose you put it in some IO completion handler, and someone has 4 IO
controllers installed and all 4 IO interrupts come in at the 'same'
time.
With types you could warn on similarly to what we do today, but with the
simple push/pop that might be a lot harder.
Anyway, with the whole cr2 fiddling bit being discussed this seems to
become redundant.
> And note that in practice many of the current types are exclusive to
> each other - so using the stack would _reduce_ the amount of
> kmap-atomic space we need.
Yeah, I did realize that.
On Tue, 2009-06-16 at 13:38 +0100, Hugh Dickins wrote:
>
> Something else to throw in: what if they were not just atomic,
> but also replaced the current sleeping kmaps? i.e. a task context
> carries around its own stack of these.
I actually did that once, but it means the task needs to be cpu-affine,
because fixmaps have different addresses between cpus. And disabling
migration for tasks has subtle side-effects so I dropped that again.
However, I recently considered the possiblity of putting the fixmaps in
the new per-cpu address space so that we might use the %gs segment to
normalize the fixmap addresses between the cpus.
This would allow full preemptible kmaps (yay for -rt).
However I suspect it might greatly complicate kmaps for the !i386 world.
Hello,
Peter Zijlstra wrote:
> On Tue, 2009-06-16 at 13:38 +0100, Hugh Dickins wrote:
>> Something else to throw in: what if they were not just atomic,
>> but also replaced the current sleeping kmaps? i.e. a task context
>> carries around its own stack of these.
>
> I actually did that once, but it means the task needs to be cpu-affine,
> because fixmaps have different addresses between cpus. And disabling
> migration for tasks has subtle side-effects so I dropped that again.
>
> However, I recently considered the possiblity of putting the fixmaps in
> the new per-cpu address space so that we might use the %gs segment to
> normalize the fixmap addresses between the cpus.
>
> This would allow full preemptible kmaps (yay for -rt).
>
> However I suspect it might greatly complicate kmaps for the !i386 world.
Other archs are in the process of conversion so once that is complete,
there's no reason this should be more difficult but it means that
kmapped addresses should be accessed differently from regular ones
which we can't do. :-(
Thanks.
--
tejun
On Wed, 2009-06-17 at 17:43 +0900, Tejun Heo wrote:
> Hello,
>
> Peter Zijlstra wrote:
> > On Tue, 2009-06-16 at 13:38 +0100, Hugh Dickins wrote:
> >> Something else to throw in: what if they were not just atomic,
> >> but also replaced the current sleeping kmaps? i.e. a task context
> >> carries around its own stack of these.
> >
> > I actually did that once, but it means the task needs to be cpu-affine,
> > because fixmaps have different addresses between cpus. And disabling
> > migration for tasks has subtle side-effects so I dropped that again.
> >
> > However, I recently considered the possiblity of putting the fixmaps in
> > the new per-cpu address space so that we might use the %gs segment to
> > normalize the fixmap addresses between the cpus.
> >
> > This would allow full preemptible kmaps (yay for -rt).
> >
> > However I suspect it might greatly complicate kmaps for the !i386 world.
>
> Other archs are in the process of conversion so once that is complete,
> there's no reason this should be more difficult but it means that
> kmapped addresses should be accessed differently from regular ones
> which we can't do. :-(
Right, we'd need percpu_memset, percpu_copy_from, percpu_copy_to, etc.
that all operate on %gs like addrs. might be a tad invasive.
* Peter Zijlstra <[email protected]> wrote:
> On Tue, 2009-06-16 at 10:13 +0200, Ingo Molnar wrote:
>
> > > By removing the types it becomes very difficult to verify the max
> > > depth. I really don't like removing them.
> >
> > The fact that it implies an atomic section pretty much limits its
> > depth in practice, doesnt it?
> >
> > All we need to track in the debug code is
> > max-{syscall,softirq,hardirq,nmi}. The sum of these 4 counts
> > must be smaller than the max - even if (as you are right to
> > point out) we dont hit that magic combo that truly maximizes the
> > depth.
>
> Right, so the thing I'd worry about is someone adding
> kmap_atomic() to an interrupt context that didn't have interrupts
> disabled and then managing to nest that a few times.
>
> Suppose you put it in some IO completion handler, and someone has
> 4 IO controllers installed and all 4 IO interrupts come in at the
> 'same' time.
>
> With types you could warn on similarly to what we do today, but
> with the simple push/pop that might be a lot harder.
Yes, fixed-purpose allocations are easier to warn about - they imply
more constraints, no doubt about that.
But we could warn about using kmap-atomic with in irq context with
irqs enabled and thus exclude the case you are worried about?
> Anyway, with the whole cr2 fiddling bit being discussed this seems
> to become redundant.
It's not just the cr2 fiddling but also conversion of pagefault
returns from IRET to RET. The kmap_atomic API change is a nice
cleanup in itself.
Ingo