Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
- all available page flags are exported, and
- exported as is
2) for admins and end users
- only the more `well known' flags are exported:
11. KPF_MMAP (pseudo flag) memory mapped page
12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
13. KPF_SWAPCACHE page is in swap cache
14. KPF_SWAPBACKED page is swap/RAM backed
15. KPF_COMPOUND_HEAD (*)
16. KPF_COMPOUND_TAIL (*)
17. KPF_UNEVICTABLE page is in the unevictable LRU list
18. KPF_HWPOISON hardware detected corruption
19. KPF_NOPAGE (pseudo flag) no page frame at the address
(*) For compound pages, exporting _both_ head/tail info enables
users to tell where a compound page starts/ends, and its order.
- limit flags to their typical usage scenario, as indicated by KOSAKI:
- LRU pages: only export relevant flags
- PG_lru
- PG_unevictable
- PG_active
- PG_referenced
- page_mapped()
- PageAnon()
- PG_swapcache
- PG_swapbacked
- PG_reclaim
- no-IO pages: mask out irrelevant flags
- PG_dirty
- PG_uptodate
- PG_writeback
- SLAB pages: mask out overloaded flags:
- PG_error
- PG_active
- PG_private
- PG_reclaim: mask out the overloaded PG_readahead
- compound flags: only export huge/gigantic pages
Here are the admin/linus views of all page flags on a newly booted nfs-root system:
# ./page-types # for admin
flags page-count MB symbolic-flags long-symbolic-flags
0x000000000000 491174 1918 ____________________________
0x000000000020 1 0 _____l______________________ lru
0x000000000028 2543 9 ___U_l______________________ uptodate,lru
0x00000000002c 5288 20 __RU_l______________________ referenced,uptodate,lru
0x000000004060 1 0 _____lA_______b_____________ lru,active,swapbacked
0x000000004064 19 0 __R__lA_______b_____________ referenced,lru,active,swapbacked
0x000000000068 225 0 ___U_lA_____________________ uptodate,lru,active
0x00000000006c 969 3 __RU_lA_____________________ referenced,uptodate,lru,active
0x000000000080 6832 26 _______S____________________ slab
0x000000000400 576 2 __________B_________________ buddy
0x000000000828 1159 4 ___U_l_____M________________ uptodate,lru,mmap
0x00000000082c 310 1 __RU_l_____M________________ referenced,uptodate,lru,mmap
0x000000004860 2 0 _____lA____M__b_____________ lru,active,mmap,swapbacked
0x000000000868 375 1 ___U_lA____M________________ uptodate,lru,active,mmap
0x00000000086c 635 2 __RU_lA____M________________ referenced,uptodate,lru,active,mmap
0x000000005860 3831 14 _____lA____Ma_b_____________ lru,active,mmap,anonymous,swapbacked
0x000000005864 28 0 __R__lA____Ma_b_____________ referenced,lru,active,mmap,anonymous,swapbacked
total 513968 2007
# ./page-types # for linus, when CONFIG_DEBUG_KERNEL is turned on
flags page-count MB symbolic-flags long-symbolic-flags
0x000000000000 471058 1840 ____________________________
0x000100000000 19288 75 ____________________r_______ reserved
0x000000010000 1064 4 ________________T___________ compound_tail
0x000000008000 1 0 _______________H____________ compound_head
0x000000008014 1 0 __R_D__________H____________ referenced,dirty,compound_head
0x000000010014 4 0 __R_D___________T___________ referenced,dirty,compound_tail
0x000000000020 1 0 _____l______________________ lru
0x000000000028 2522 9 ___U_l______________________ uptodate,lru
0x00000000002c 5207 20 __RU_l______________________ referenced,uptodate,lru
0x000000000068 203 0 ___U_lA_____________________ uptodate,lru,active
0x00000000006c 869 3 __RU_lA_____________________ referenced,uptodate,lru,active
0x000000004078 1 0 ___UDlA_______b_____________ uptodate,dirty,lru,active,swapbacked
0x00000000407c 19 0 __RUDlA_______b_____________ referenced,uptodate,dirty,lru,active,swapbacked
0x000000000080 5989 23 _______S____________________ slab
0x000000008080 778 3 _______S_______H____________ slab,compound_head
0x000000000228 44 0 ___U_l___I__________________ uptodate,lru,reclaim
0x00000000022c 39 0 __RU_l___I__________________ referenced,uptodate,lru,reclaim
0x000000000268 12 0 ___U_lA__I__________________ uptodate,lru,active,reclaim
0x00000000026c 44 0 __RU_lA__I__________________ referenced,uptodate,lru,active,reclaim
0x000000000400 550 2 __________B_________________ buddy
0x000000000804 1 0 __R________M________________ referenced,mmap
0x000000000828 1068 4 ___U_l_____M________________ uptodate,lru,mmap
0x00000000082c 326 1 __RU_l_____M________________ referenced,uptodate,lru,mmap
0x000000000868 335 1 ___U_lA____M________________ uptodate,lru,active,mmap
0x00000000086c 599 2 __RU_lA____M________________ referenced,uptodate,lru,active,mmap
0x000000004878 2 0 ___UDlA____M__b_____________ uptodate,dirty,lru,active,mmap,swapbacked
0x000000000a28 44 0 ___U_l___I_M________________ uptodate,lru,reclaim,mmap
0x000000000a2c 12 0 __RU_l___I_M________________ referenced,uptodate,lru,reclaim,mmap
0x000000000a68 8 0 ___U_lA__I_M________________ uptodate,lru,active,reclaim,mmap
0x000000000a6c 31 0 __RU_lA__I_M________________ referenced,uptodate,lru,active,reclaim,mmap
0x000000001000 442 1 ____________a_______________ anonymous
0x000000005808 7 0 ___U_______Ma_b_____________ uptodate,mmap,anonymous,swapbacked
0x000000005868 3371 13 ___U_lA____Ma_b_____________ uptodate,lru,active,mmap,anonymous,swapbacked
0x00000000586c 28 0 __RU_lA____Ma_b_____________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 513968 2007
Thanks to KOSAKI and Andi for their valuable recommendations!
Cc: KOSAKI Motohiro <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Matt Mackall <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/proc/page.c | 197 +++++++++++++++++++++++++++++++++++++++--------
1 file changed, 167 insertions(+), 30 deletions(-)
--- mm.orig/fs/proc/page.c
+++ mm/fs/proc/page.c
@@ -6,6 +6,7 @@
#include <linux/mmzone.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
+#include <linux/backing-dev.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -70,19 +71,172 @@ static const struct file_operations proc
/* These macros are used to decouple internal flags from exported ones */
-#define KPF_LOCKED 0
-#define KPF_ERROR 1
-#define KPF_REFERENCED 2
-#define KPF_UPTODATE 3
-#define KPF_DIRTY 4
-#define KPF_LRU 5
-#define KPF_ACTIVE 6
-#define KPF_SLAB 7
-#define KPF_WRITEBACK 8
-#define KPF_RECLAIM 9
-#define KPF_BUDDY 10
+#define KPF_LOCKED 0
+#define KPF_ERROR 1
+#define KPF_REFERENCED 2
+#define KPF_UPTODATE 3
+#define KPF_DIRTY 4
+#define KPF_LRU 5
+#define KPF_ACTIVE 6
+#define KPF_SLAB 7
+#define KPF_WRITEBACK 8
+#define KPF_RECLAIM 9
+#define KPF_BUDDY 10
+
+/* new additions in 2.6.31 */
+#define KPF_MMAP 11
+#define KPF_ANON 12
+#define KPF_SWAPCACHE 13
+#define KPF_SWAPBACKED 14
+#define KPF_COMPOUND_HEAD 15
+#define KPF_COMPOUND_TAIL 16
+#define KPF_UNEVICTABLE 17
+#define KPF_HWPOISON 18
+#define KPF_NOPAGE 19
+
+/* kernel hacking assistances */
+#define KPF_RESERVED 32
+#define KPF_MLOCKED 33
+#define KPF_MAPPEDTODISK 34
+#define KPF_PRIVATE 35
+#define KPF_PRIVATE2 36
+#define KPF_OWNER_PRIVATE 37
+#define KPF_ARCH 38
+#define KPF_UNCACHED 39
+
+/*
+ * Kernel flags are exported faithfully to Linus and his fellow hackers.
+ * Otherwise some details are masked to avoid confusing the end user:
+ * - some kernel flags are completely invisible
+ * - some kernel flags are conditionally invisible on their odd usages
+ */
+#ifdef CONFIG_DEBUG_KERNEL
+static inline int genuine_linus(void) { return 1; }
+#else
+static inline int genuine_linus(void) { return 0; }
+#endif
+
+#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
+ do { \
+ if (visible || genuine_linus()) \
+ uflags |= ((kflags >> kbit) & 1) << ubit; \
+ } while (0);
+
+/* a helper function _not_ intended for more general uses */
+static inline int page_cap_writeback_dirty(struct page *page)
+{
+ struct address_space *mapping;
+
+ if (!PageSlab(page))
+ mapping = page_mapping(page);
+ else
+ mapping = NULL;
+
+ return mapping && mapping_cap_writeback_dirty(mapping);
+}
+
+static u64 get_uflags(struct page *page)
+{
+ u64 k;
+ u64 u;
+ int io;
+ int lru;
+ int slab;
+
+ /*
+ * pseudo flag: KPF_NOPAGE
+ * it differentiates a memory hole from a page with no flags
+ */
+ if (!page)
+ return 1 << KPF_NOPAGE;
+
+ k = page->flags;
+ u = 0;
+
+ io = page_cap_writeback_dirty(page);
+ lru = k & (1 << PG_lru);
+ slab = k & (1 << PG_slab);
+
+ /*
+ * pseudo flags for the well known (anonymous) memory mapped pages
+ */
+ if (lru || genuine_linus()) {
+ if (!slab && page_mapped(page))
+ u |= 1 << KPF_MMAP;
+ if (PageAnon(page))
+ u |= 1 << KPF_ANON;
+ }
-#define kpf_copy_bit(flags, dstpos, srcpos) (((flags >> srcpos) & 1) << dstpos)
+ /*
+ * compound pages: export both head/tail info
+ * they together define a compound page's start/end pos and order
+ */
+ if (PageHuge(page) || genuine_linus()) {
+ if (PageHead(page))
+ u |= 1 << KPF_COMPOUND_HEAD;
+ if (PageTail(page))
+ u |= 1 << KPF_COMPOUND_TAIL;
+ }
+
+ kpf_copy_bit(u, k, 1, KPF_LOCKED, PG_locked);
+
+ /*
+ * Caveats on high order pages:
+ * PG_buddy will only be set on the head page; SLUB/SLQB do the same
+ * for PG_slab; SLOB won't set PG_slab at all on compound pages.
+ */
+ kpf_copy_bit(u, k, 1, KPF_SLAB, PG_slab);
+ kpf_copy_bit(u, k, 1, KPF_BUDDY, PG_buddy);
+
+ kpf_copy_bit(u, k, io, KPF_ERROR, PG_error);
+ kpf_copy_bit(u, k, io, KPF_DIRTY, PG_dirty);
+ kpf_copy_bit(u, k, io, KPF_UPTODATE, PG_uptodate);
+ kpf_copy_bit(u, k, io, KPF_WRITEBACK, PG_writeback);
+
+ kpf_copy_bit(u, k, 1, KPF_LRU, PG_lru);
+ kpf_copy_bit(u, k, lru, KPF_REFERENCED, PG_referenced);
+ kpf_copy_bit(u, k, lru, KPF_ACTIVE, PG_active);
+ kpf_copy_bit(u, k, lru, KPF_RECLAIM, PG_reclaim);
+
+ kpf_copy_bit(u, k, lru, KPF_SWAPCACHE, PG_swapcache);
+ kpf_copy_bit(u, k, lru, KPF_SWAPBACKED, PG_swapbacked);
+
+#ifdef CONFIG_MEMORY_FAILURE
+ kpf_copy_bit(u, k, 1, KPF_HWPOISON, PG_hwpoison);
+#endif
+
+#ifdef CONFIG_UNEVICTABLE_LRU
+ kpf_copy_bit(u, k, lru, KPF_UNEVICTABLE, PG_unevictable);
+ kpf_copy_bit(u, k, 0, KPF_MLOCKED, PG_mlocked);
+#endif
+
+ kpf_copy_bit(u, k, 0, KPF_RESERVED, PG_reserved);
+ kpf_copy_bit(u, k, 0, KPF_MAPPEDTODISK, PG_mappedtodisk);
+ kpf_copy_bit(u, k, 0, KPF_PRIVATE, PG_private);
+ kpf_copy_bit(u, k, 0, KPF_PRIVATE2, PG_private_2);
+ kpf_copy_bit(u, k, 0, KPF_OWNER_PRIVATE, PG_owner_priv_1);
+ kpf_copy_bit(u, k, 0, KPF_ARCH, PG_arch_1);
+
+#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
+ kpf_copy_bit(u, k, 0, KPF_UNCACHED, PG_uncached);
+#endif
+
+ if (!genuine_linus()) {
+ /*
+ * SLUB overloads some page flags which may confuse end user.
+ */
+ if (slab)
+ u &= ~((1 << KPF_ACTIVE) | (1 << KPF_ERROR));
+ /*
+ * PG_reclaim could be overloaded as PG_readahead,
+ * and we only want to export the first one.
+ */
+ if (!(u & (1 << KPF_WRITEBACK)))
+ u &= ~(1 << KPF_RECLAIM);
+ }
+
+ return u;
+};
static ssize_t kpageflags_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
@@ -92,7 +246,6 @@ static ssize_t kpageflags_read(struct fi
unsigned long src = *ppos;
unsigned long pfn;
ssize_t ret = 0;
- u64 kflags, uflags;
pfn = src / KPMSIZE;
count = min_t(unsigned long, count, (max_pfn * KPMSIZE) - src);
@@ -104,24 +257,8 @@ static ssize_t kpageflags_read(struct fi
ppage = pfn_to_page(pfn);
else
ppage = NULL;
- if (!ppage)
- kflags = 0;
- else
- kflags = ppage->flags;
-
- uflags = kpf_copy_bit(kflags, KPF_LOCKED, PG_locked) |
- kpf_copy_bit(kflags, KPF_ERROR, PG_error) |
- kpf_copy_bit(kflags, KPF_REFERENCED, PG_referenced) |
- kpf_copy_bit(kflags, KPF_UPTODATE, PG_uptodate) |
- kpf_copy_bit(kflags, KPF_DIRTY, PG_dirty) |
- kpf_copy_bit(kflags, KPF_LRU, PG_lru) |
- kpf_copy_bit(kflags, KPF_ACTIVE, PG_active) |
- kpf_copy_bit(kflags, KPF_SLAB, PG_slab) |
- kpf_copy_bit(kflags, KPF_WRITEBACK, PG_writeback) |
- kpf_copy_bit(kflags, KPF_RECLAIM, PG_reclaim) |
- kpf_copy_bit(kflags, KPF_BUDDY, PG_buddy);
- if (put_user(uflags, out)) {
+ if (put_user(get_uflags(ppage), out)) {
ret = -EFAULT;
break;
}
--
* Wu Fengguang <[email protected]> wrote:
> Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
>
> 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
> - all available page flags are exported, and
> - exported as is
> 2) for admins and end users
> - only the more `well known' flags are exported:
> 11. KPF_MMAP (pseudo flag) memory mapped page
> 12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
> 13. KPF_SWAPCACHE page is in swap cache
> 14. KPF_SWAPBACKED page is swap/RAM backed
> 15. KPF_COMPOUND_HEAD (*)
> 16. KPF_COMPOUND_TAIL (*)
> 17. KPF_UNEVICTABLE page is in the unevictable LRU list
> 18. KPF_HWPOISON hardware detected corruption
> 19. KPF_NOPAGE (pseudo flag) no page frame at the address
>
> (*) For compound pages, exporting _both_ head/tail info enables
> users to tell where a compound page starts/ends, and its order.
>
> - limit flags to their typical usage scenario, as indicated by KOSAKI:
> - LRU pages: only export relevant flags
> - PG_lru
> - PG_unevictable
> - PG_active
> - PG_referenced
> - page_mapped()
> - PageAnon()
> - PG_swapcache
> - PG_swapbacked
> - PG_reclaim
> - no-IO pages: mask out irrelevant flags
> - PG_dirty
> - PG_uptodate
> - PG_writeback
> - SLAB pages: mask out overloaded flags:
> - PG_error
> - PG_active
> - PG_private
> - PG_reclaim: mask out the overloaded PG_readahead
> - compound flags: only export huge/gigantic pages
>
> Here are the admin/linus views of all page flags on a newly booted nfs-root system:
>
> # ./page-types # for admin
> flags page-count MB symbolic-flags long-symbolic-flags
> 0x000000000000 491174 1918 ____________________________
> 0x000000000020 1 0 _____l______________________ lru
> 0x000000000028 2543 9 ___U_l______________________ uptodate,lru
> 0x00000000002c 5288 20 __RU_l______________________ referenced,uptodate,lru
> 0x000000004060 1 0 _____lA_______b_____________ lru,active,swapbacked
I think i have to NAK this kind of ad-hoc instrumentation of kernel
internals and statistics until we clear up why such instrumentation
measures are being accepted into the MM while other, more dynamic
and more flexible MM instrumentation are being resisted by Andrew.
The above type of condensed information can be built out of dynamic
trace data too - and much more. Being able to track page state
transitions is very valuable when debugging VM problems. One such
'view' of trace data would be a summary histogram like above.
( done after a "echo 3 > /proc/sys/vm/drop_caches" to make sure all
interesting pages have been re-established and their state is
present in the trace. )
The SLAB code already has such a facility, kmemtrace: it's very
useful and successful in visualizing complex SLAB details, both
dynamically and statically.
I think the same general approach should be used for the page
allocator too (and for the page cache and some other struct page
based caches): the life-time of an object should be followed. If we
capture the important details we capture the big picture too. Pekka
already sent an RFC patch to extend kmemtrace in such a fashion. Why
is that more useful method not being pursued?
By extending upon the (existing) /proc/kpageflags hack a usecase is
taken away from the tracing based solution and a needless overlap is
created - and that's not particularly helpful IMHO. We now have all
the facilities upstream that allow us to do intelligent
instrumentation - we should make use of them.
Ingo
> I think i have to NAK this kind of ad-hoc instrumentation of kernel
> internals and statistics until we clear up why such instrumentation
I think because it has zero fast path overhead and can be used
any time without enabling anything special.
> measures are being accepted into the MM while other, more dynamic
While the dynamic instrumentation you're proposing
has non zero fast path overhead, especially if you consider the
CPU time needed for the backend computation in user space too.
And it requires explicit tracing first and some backend
that counts the events and maintains a shadow data structure
covering all of mem_map again.
So it's clear your alternative will be much more costly, plus
have additional drawbacks (needs enabling first, cannot
take a snapshot at arbitary time)
Also dynamic tracing tends to have trouble with full memory
observation. I experimented with systemtap tracing for my
memory usage paper I did a couple of years ago, but ended
up with integrated counters (similar to those) because it was
impossible to do proper accounting for the pages set up
in early boot with the standard tracers.
I suspect both have their uses (that's indeed some things
that can only be done with dynamic tracing), but they're clearly
complementary and the static facility seems useful enough
on its own.
I think Fengguang is demonstrating that clearly by the great
improvements he's doing for readahead which are enabled by these
patches.
-Andi
Hi Andi,
On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> > internals and statistics until we clear up why such instrumentation
>
> I think because it has zero fast path overhead and can be used
> any time without enabling anything special.
Yes, zero overhead is important for certain things (like
CONFIG_SLUB_STATS, for example). However, putting slab allocator
specific checks in fs/proc looks pretty fragile to me. It would be nice
to have this under the "kmemtrace umbrella" so that there's just one
place that needs to be fixed up when allocators change.
Also, while you probably don't want to use tracepoints for this kind of
instrumentation, you might want to look into reusing the ftrace
reporting bits.
Pekka
> Yes, zero overhead is important for certain things (like
> CONFIG_SLUB_STATS, for example). However, putting slab allocator
> specific checks in fs/proc looks pretty fragile to me. It would be nice
Ok, perhaps that could be put into a inline into slab.h. Would
that address your concerns?
> Also, while you probably don't want to use tracepoints for this kind of
> instrumentation, you might want to look into reusing the ftrace
> reporting bits.
There's already perfectly fine code in tree for this, I don't see why it would
need another infrastructure that doesn't really fit anyways.
-Andi
--
[email protected] -- Speaking for myself only.
On Tue, Apr 28, 2009 at 08:55:07AM +0200, Ingo Molnar wrote:
>
> * Wu Fengguang <[email protected]> wrote:
>
> > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> >
> > 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
> > - all available page flags are exported, and
> > - exported as is
> > 2) for admins and end users
> > - only the more `well known' flags are exported:
> > 11. KPF_MMAP (pseudo flag) memory mapped page
> > 12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
> > 13. KPF_SWAPCACHE page is in swap cache
> > 14. KPF_SWAPBACKED page is swap/RAM backed
> > 15. KPF_COMPOUND_HEAD (*)
> > 16. KPF_COMPOUND_TAIL (*)
> > 17. KPF_UNEVICTABLE page is in the unevictable LRU list
> > 18. KPF_HWPOISON hardware detected corruption
> > 19. KPF_NOPAGE (pseudo flag) no page frame at the address
> >
> > (*) For compound pages, exporting _both_ head/tail info enables
> > users to tell where a compound page starts/ends, and its order.
> >
> > - limit flags to their typical usage scenario, as indicated by KOSAKI:
> > - LRU pages: only export relevant flags
> > - PG_lru
> > - PG_unevictable
> > - PG_active
> > - PG_referenced
> > - page_mapped()
> > - PageAnon()
> > - PG_swapcache
> > - PG_swapbacked
> > - PG_reclaim
> > - no-IO pages: mask out irrelevant flags
> > - PG_dirty
> > - PG_uptodate
> > - PG_writeback
> > - SLAB pages: mask out overloaded flags:
> > - PG_error
> > - PG_active
> > - PG_private
> > - PG_reclaim: mask out the overloaded PG_readahead
> > - compound flags: only export huge/gigantic pages
> >
> > Here are the admin/linus views of all page flags on a newly booted nfs-root system:
> >
> > # ./page-types # for admin
> > flags page-count MB symbolic-flags long-symbolic-flags
> > 0x000000000000 491174 1918 ____________________________
> > 0x000000000020 1 0 _____l______________________ lru
> > 0x000000000028 2543 9 ___U_l______________________ uptodate,lru
> > 0x00000000002c 5288 20 __RU_l______________________ referenced,uptodate,lru
> > 0x000000004060 1 0 _____lA_______b_____________ lru,active,swapbacked
>
> I think i have to NAK this kind of ad-hoc instrumentation of kernel
> internals and statistics until we clear up why such instrumentation
> measures are being accepted into the MM while other, more dynamic
> and more flexible MM instrumentation are being resisted by Andrew.
An unexpected NAK - to throw away an orange because we are to have an apple? ;-)
Anyway here are the missing rationals.
1) FAST
It takes merely 0.2s to scan 4GB pages:
./page-types 0.02s user 0.20s system 99% cpu 0.216 total
2) SIMPLE
/proc/kpageflags will be a *long standing* hack we have to live with -
it was originally introduced by Matt to do shared memory accounting and
a facility to analyze applications' memory consumptions, with the hope
it will also help kernel developers someday.
So why not extend and embrace it, in a straightforward way?
3) USE CASES
I have/will take advantage of the above page-types command in a number ways:
- to help track down memory leak (the recent trace/ring_buffer.c case)
- to estimate the system wide readahead miss ratio
- Andi want to examine the major page types in different workloads
(for the hwpoison work)
- Me too, for fun of learning: read/write/lock/whatever a lot of pages
and examine their flags, to get an idea of some random kernel behaviors.
(the dynamic tracing tools can be more helpful, as a different view)
4) COMPLEMENTARITY
In some cases the dynamic tracing tool is not enough (or too complex)
to rebuild the current status view.
I myself have a dynamic readahead tracing tool(very useful!).
At the same time I also use readahead accounting numbers, and the
/proc/filecache tool(frequently!), and the above page-types tool.
I simply need them all - they are handy for different cases.
Thanks,
Fengguang
> The above type of condensed information can be built out of dynamic
> trace data too - and much more. Being able to track page state
> transitions is very valuable when debugging VM problems. One such
> 'view' of trace data would be a summary histogram like above.
>
> ( done after a "echo 3 > /proc/sys/vm/drop_caches" to make sure all
> interesting pages have been re-established and their state is
> present in the trace. )
>
> The SLAB code already has such a facility, kmemtrace: it's very
> useful and successful in visualizing complex SLAB details, both
> dynamically and statically.
>
> I think the same general approach should be used for the page
> allocator too (and for the page cache and some other struct page
> based caches): the life-time of an object should be followed. If we
> capture the important details we capture the big picture too. Pekka
> already sent an RFC patch to extend kmemtrace in such a fashion. Why
> is that more useful method not being pursued?
>
> By extending upon the (existing) /proc/kpageflags hack a usecase is
> taken away from the tracing based solution and a needless overlap is
> created - and that's not particularly helpful IMHO. We now have all
> the facilities upstream that allow us to do intelligent
> instrumentation - we should make use of them.
>
> Ingo
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Hi Andi,
On Tue, Apr 28, 2009 at 12:10 PM, Andi Kleen <[email protected]> wrote:
>> Yes, zero overhead is important for certain things (like
>> CONFIG_SLUB_STATS, for example). However, putting slab allocator
>> specific checks in fs/proc looks pretty fragile to me. It would be nice
>
> Ok, perhaps that could be put into a inline into slab.h. Would
> that address your concerns?
Yeah, I'm fine with that. Putting them in the individual
slub_def.h/slob_def.h headers might be even better.
On Tue, Apr 28, 2009 at 12:10 PM, Andi Kleen <[email protected]> wrote:
>> Also, while you probably don't want to use tracepoints for this kind of
>> instrumentation, you might want to look into reusing the ftrace
>> reporting bits.
>
> There's already perfectly fine code in tree for this, I don't see why it would
> need another infrastructure that doesn't really fit anyways.
It's just that I suspect that we want page flag printing and
zero-overhead statistics for kmemtrace at some point. But anyway, I'm
not objecting to extending /proc/kpageflags if that's what people want
to do.
Pekka
* Pekka Enberg <[email protected]> wrote:
> Hi Andi,
>
> On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> > > internals and statistics until we clear up why such instrumentation
> >
> > I think because it has zero fast path overhead and can be used
> > any time without enabling anything special.
( That's a dubious claim in any case - tracepoints are very cheap.
And they could be made even cheaper and such efforts would benefit
all the tracepoint users so it's a prime focus of interest.
Andi is a SystemTap proponent, right? I saw him oppose pretty much
everything built-in kernel tracing related. I consider that a
pretty extreme position. )
> Yes, zero overhead is important for certain things (like
> CONFIG_SLUB_STATS, for example). However, putting slab allocator
> specific checks in fs/proc looks pretty fragile to me. It would be
> nice to have this under the "kmemtrace umbrella" so that there's
> just one place that needs to be fixed up when allocators change.
>
> Also, while you probably don't want to use tracepoints for this
> kind of instrumentation, you might want to look into reusing the
> ftrace reporting bits.
Exactly - we have a tracing and statistics framework for a reason.
Ingo
Hi Ingo,
On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
>> > > internals and statistics until we clear up why such instrumentation
* Pekka Enberg <[email protected]> wrote:
>> > I think because it has zero fast path overhead and can be used
>> > any time without enabling anything special.
On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <[email protected]> wrote:
> ( That's a dubious claim in any case - tracepoints are very cheap.
> ?And they could be made even cheaper and such efforts would benefit
> ?all the tracepoint users so it's a prime focus of interest.
> ?Andi is a SystemTap proponent, right? I saw him oppose pretty much
> ?everything built-in kernel tracing related. I consider that a
> ?pretty extreme position. )
I have no idea how expensive tracepoints are but I suspect they don't
make too much sense for this particular scenario. After all, kmemtrace
is mainly interested in _allocation patterns_ whereas this patch seems
to be more interested in "memory layout" type of things.
Pekka
On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
>>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
>>> > > internals and statistics until we clear up why such instrumentation
* Pekka Enberg <[email protected]> wrote:
>>> > I think because it has zero fast path overhead and can be used
>>> > any time without enabling anything special.
>
> On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <[email protected]> wrote:
>> ( That's a dubious claim in any case - tracepoints are very cheap.
>> ?And they could be made even cheaper and such efforts would benefit
>> ?all the tracepoint users so it's a prime focus of interest.
>> ?Andi is a SystemTap proponent, right? I saw him oppose pretty much
>> ?everything built-in kernel tracing related. I consider that a
>> ?pretty extreme position. )
On Tue, Apr 28, 2009 at 12:19 PM, Pekka Enberg <[email protected]> wrote:
> I have no idea how expensive tracepoints are but I suspect they don't
> make too much sense for this particular scenario. After all, kmemtrace
> is mainly interested in _allocation patterns_ whereas this patch seems
> to be more interested in "memory layout" type of things.
That said, I do foresee a need to be able to turn on more detailed
tracing after you've identified problematic areas from kpageflags type
of overview report. And for that, you almost certainly want
kmemtrace/tracepoints style solution with pid/function/whatever regexp
matching ftrace already provides.
Pekka
* Wu Fengguang <[email protected]> wrote:
> On Tue, Apr 28, 2009 at 08:55:07AM +0200, Ingo Molnar wrote:
> >
> > * Wu Fengguang <[email protected]> wrote:
> >
> > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > >
> > > 1) for kernel hackers (on CONFIG_DEBUG_KERNEL)
> > > - all available page flags are exported, and
> > > - exported as is
> > > 2) for admins and end users
> > > - only the more `well known' flags are exported:
> > > 11. KPF_MMAP (pseudo flag) memory mapped page
> > > 12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
> > > 13. KPF_SWAPCACHE page is in swap cache
> > > 14. KPF_SWAPBACKED page is swap/RAM backed
> > > 15. KPF_COMPOUND_HEAD (*)
> > > 16. KPF_COMPOUND_TAIL (*)
> > > 17. KPF_UNEVICTABLE page is in the unevictable LRU list
> > > 18. KPF_HWPOISON hardware detected corruption
> > > 19. KPF_NOPAGE (pseudo flag) no page frame at the address
> > >
> > > (*) For compound pages, exporting _both_ head/tail info enables
> > > users to tell where a compound page starts/ends, and its order.
> > >
> > > - limit flags to their typical usage scenario, as indicated by KOSAKI:
> > > - LRU pages: only export relevant flags
> > > - PG_lru
> > > - PG_unevictable
> > > - PG_active
> > > - PG_referenced
> > > - page_mapped()
> > > - PageAnon()
> > > - PG_swapcache
> > > - PG_swapbacked
> > > - PG_reclaim
> > > - no-IO pages: mask out irrelevant flags
> > > - PG_dirty
> > > - PG_uptodate
> > > - PG_writeback
> > > - SLAB pages: mask out overloaded flags:
> > > - PG_error
> > > - PG_active
> > > - PG_private
> > > - PG_reclaim: mask out the overloaded PG_readahead
> > > - compound flags: only export huge/gigantic pages
> > >
> > > Here are the admin/linus views of all page flags on a newly booted nfs-root system:
> > >
> > > # ./page-types # for admin
> > > flags page-count MB symbolic-flags long-symbolic-flags
> > > 0x000000000000 491174 1918 ____________________________
> > > 0x000000000020 1 0 _____l______________________ lru
> > > 0x000000000028 2543 9 ___U_l______________________ uptodate,lru
> > > 0x00000000002c 5288 20 __RU_l______________________ referenced,uptodate,lru
> > > 0x000000004060 1 0 _____lA_______b_____________ lru,active,swapbacked
> >
> > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> > internals and statistics until we clear up why such instrumentation
> > measures are being accepted into the MM while other, more dynamic
> > and more flexible MM instrumentation are being resisted by Andrew.
>
> An unexpected NAK - to throw away an orange because we are to have an apple? ;-)
>
> Anyway here are the missing rationals.
>
> 1) FAST
>
> It takes merely 0.2s to scan 4GB pages:
>
> ./page-types 0.02s user 0.20s system 99% cpu 0.216 total
>
> 2) SIMPLE
>
> /proc/kpageflags will be a *long standing* hack we have to live
> with - it was originally introduced by Matt to do shared memory
> accounting and a facility to analyze applications' memory
> consumptions, with the hope it will also help kernel developers
> someday.
>
> So why not extend and embrace it, in a straightforward way?
>
> 3) USE CASES
>
> I have/will take advantage of the above page-types command in a number ways:
> - to help track down memory leak (the recent trace/ring_buffer.c case)
> - to estimate the system wide readahead miss ratio
> - Andi want to examine the major page types in different workloads
> (for the hwpoison work)
> - Me too, for fun of learning: read/write/lock/whatever a lot of pages
> and examine their flags, to get an idea of some random kernel behaviors.
> (the dynamic tracing tools can be more helpful, as a different view)
>
> 4) COMPLEMENTARITY
>
> In some cases the dynamic tracing tool is not enough (or too complex)
> to rebuild the current status view.
>
> I myself have a dynamic readahead tracing tool(very useful!). At
> the same time I also use readahead accounting numbers, and the
> /proc/filecache tool(frequently!), and the above page-types tool.
> I simply need them all - they are handy for different cases.
Well, the main counter argument here is that statistics is _derived_
from events. In their simplest form the 'counts' are the integral of
events over time.
So if we capture all interesting events, and do that with low
overhead (and in fact can even collect and integrate them in-kernel,
today), we _dont have_ to maintain various overlapping counters all
around the kernel. This is really a general instrumentation design
observation.
Every time we add yet another /proc hack we splinter Linux
instrumentation, in a hard to reverse way.
So your single-purpose /proc hack could be made multi-purpose and
could help a much broader range of people, with just a little bit of
effort i believe. Pekka already wrote the page tracking patch for
example, that would be a good starting point.
Does it mean more work to do? You bet ;-)
Ingo
* Pekka Enberg <[email protected]> wrote:
> I have no idea how expensive tracepoints are but I suspect they
> don't make too much sense for this particular scenario. After all,
> kmemtrace is mainly interested in _allocation patterns_ whereas
> this patch seems to be more interested in "memory layout" type of
> things.
My point is that the allocation patterns can be derived from dynamic
events. We can build a map of everything if we know all the events
that led up to it. Doing:
echo 3 > /proc/sys/vm/drop_caches
will clear 99% of the memory allocations, so we can build a new map
from scratch just about anytime. (and if boot allocations are
interesting they can be traced too)
_And_ via this angle we'll also have access to the dynamic events,
in a different 'view' of the same tracepoints - which is obviously
very useful for different purposes.
Ingo
>
> * Pekka Enberg <[email protected]> wrote:
>
> > I have no idea how expensive tracepoints are but I suspect they
> > don't make too much sense for this particular scenario. After all,
> > kmemtrace is mainly interested in _allocation patterns_ whereas
> > this patch seems to be more interested in "memory layout" type of
> > things.
>
> My point is that the allocation patterns can be derived from dynamic
> events. We can build a map of everything if we know all the events
> that led up to it. Doing:
>
> echo 3 > /proc/sys/vm/drop_caches
>
> will clear 99% of the memory allocations, so we can build a new map
> from scratch just about anytime. (and if boot allocations are
> interesting they can be traced too)
>
> _And_ via this angle we'll also have access to the dynamic events,
> in a different 'view' of the same tracepoints - which is obviously
> very useful for different purposes.
I am one of most strongly want guys to MM tracepoint.
but No, many cunstomer never permit to use drop_caches.
I believe this patch and tracepoint are _both_ necessary and useful.
On Tue, Apr 28, 2009 at 05:25:06PM +0800, Pekka Enberg wrote:
> On Tue, 2009-04-28 at 09:40 +0200, Andi Kleen wrote:
> >>> > > I think i have to NAK this kind of ad-hoc instrumentation of kernel
> >>> > > internals and statistics until we clear up why such instrumentation
>
> * Pekka Enberg <[email protected]> wrote:
> >>> > I think because it has zero fast path overhead and can be used
> >>> > any time without enabling anything special.
> >
> > On Tue, Apr 28, 2009 at 12:15 PM, Ingo Molnar <[email protected]> wrote:
> >> ( That's a dubious claim in any case - tracepoints are very cheap.
> >> Â And they could be made even cheaper and such efforts would benefit
> >> Â all the tracepoint users so it's a prime focus of interest.
> >> Â Andi is a SystemTap proponent, right? I saw him oppose pretty much
> >> Â everything built-in kernel tracing related. I consider that a
> >> Â pretty extreme position. )
>
> On Tue, Apr 28, 2009 at 12:19 PM, Pekka Enberg <[email protected]> wrote:
> > I have no idea how expensive tracepoints are but I suspect they don't
> > make too much sense for this particular scenario. After all, kmemtrace
> > is mainly interested in _allocation patterns_ whereas this patch seems
> > to be more interested in "memory layout" type of things.
>
> That said, I do foresee a need to be able to turn on more detailed
> tracing after you've identified problematic areas from kpageflags type
> of overview report. And for that, you almost certainly want
> kmemtrace/tracepoints style solution with pid/function/whatever regexp
> matching ftrace already provides.
Exactly - kmemtrace is the tool I looked for when hunting down the
page flags of the leaked ring buffer memory :-)
Thanks,
Fengguang
* Pekka Enberg <[email protected]> wrote:
> > I have no idea how expensive tracepoints are but I suspect they
> > don't make too much sense for this particular scenario. After
> > all, kmemtrace is mainly interested in _allocation patterns_
> > whereas this patch seems to be more interested in "memory
> > layout" type of things.
>
> That said, I do foresee a need to be able to turn on more detailed
> tracing after you've identified problematic areas from kpageflags
> type of overview report. And for that, you almost certainly want
> kmemtrace/tracepoints style solution with pid/function/whatever
> regexp matching ftrace already provides.
yes. My point is that by having the latter, we pretty much have the
former as well!
I 'integrate' traces all the time to get summary counts. This series
of dynamic events:
allocation
page count up
page count up
page count down
page count up
page count up
page count up
page count up
integrates into: "page count is 6".
Note that "integration" can be done wholly in the kernel too,
without going to the overhead of streaming all dynamic events to
user-space, just to summarize data into counts, in-kernel. That is
what the ftrace statistics framework and various ftrace plugins are
about.
Also, it might make sense to extend the framework with a series of
'get current object state' events when tracing is turned on. A
special case of _that_ would in essence be what the /proc hack does
now - just expressed in a much more generic, and a much more usable
form.
Ingo
* KOSAKI Motohiro <[email protected]> wrote:
> >
> > * Pekka Enberg <[email protected]> wrote:
> >
> > > I have no idea how expensive tracepoints are but I suspect they
> > > don't make too much sense for this particular scenario. After all,
> > > kmemtrace is mainly interested in _allocation patterns_ whereas
> > > this patch seems to be more interested in "memory layout" type of
> > > things.
> >
> > My point is that the allocation patterns can be derived from dynamic
> > events. We can build a map of everything if we know all the events
> > that led up to it. Doing:
> >
> > echo 3 > /proc/sys/vm/drop_caches
> >
> > will clear 99% of the memory allocations, so we can build a new map
> > from scratch just about anytime. (and if boot allocations are
> > interesting they can be traced too)
> >
> > _And_ via this angle we'll also have access to the dynamic events,
> > in a different 'view' of the same tracepoints - which is obviously
> > very useful for different purposes.
>
> I am one of most strongly want guys to MM tracepoint. but No, many
> cunstomer never permit to use drop_caches.
See my other mail i just sent: it would be a natural extension of
tracing to also dump all current object state when tracing is turned
on. That way no drop_caches is needed at all.
But it has to be expressed in one framework that cares about the
totality of the kernel - not just these splintered bits of
instrumentation and pieces of statistics.
Ingo
On Tue, Apr 28, 2009 at 05:38:33PM +0800, Ingo Molnar wrote:
>
> * KOSAKI Motohiro <[email protected]> wrote:
>
> > >
> > > * Pekka Enberg <[email protected]> wrote:
> > >
> > > > I have no idea how expensive tracepoints are but I suspect they
> > > > don't make too much sense for this particular scenario. After all,
> > > > kmemtrace is mainly interested in _allocation patterns_ whereas
> > > > this patch seems to be more interested in "memory layout" type of
> > > > things.
> > >
> > > My point is that the allocation patterns can be derived from dynamic
> > > events. We can build a map of everything if we know all the events
> > > that led up to it. Doing:
> > >
> > > echo 3 > /proc/sys/vm/drop_caches
> > >
> > > will clear 99% of the memory allocations, so we can build a new map
> > > from scratch just about anytime. (and if boot allocations are
> > > interesting they can be traced too)
> > >
> > > _And_ via this angle we'll also have access to the dynamic events,
> > > in a different 'view' of the same tracepoints - which is obviously
> > > very useful for different purposes.
> >
> > I am one of most strongly want guys to MM tracepoint. but No, many
> > cunstomer never permit to use drop_caches.
>
> See my other mail i just sent: it would be a natural extension of
> tracing to also dump all current object state when tracing is turned
> on. That way no drop_caches is needed at all.
I can understand the merits here - I also did readahead
tracing/accounting in _one_ piece of code. Very handy.
The readahead traces are now raw printks - converting to the ftrace
framework would be a big win.
But. It's still not a fit-all solution. Imagine when full data _since_
booting is required, but the user cannot afford a reboot.
> But it has to be expressed in one framework that cares about the
> totality of the kernel - not just these splintered bits of
> instrumentation and pieces of statistics.
Though minded to push the kpageflags interface, I totally agree the
above fine principle and discipline :-)
Thanks,
Fengguang
Hi Ingo,
On Tue, Apr 28, 2009 at 12:36 PM, Ingo Molnar <[email protected]> wrote:
> I 'integrate' traces all the time to get summary counts. This series
> of dynamic events:
>
> ?allocation
> ?page count up
> ?page count up
> ?page count down
> ?page count up
> ?page count up
> ?page count up
> ?page count up
>
> integrates into: "page count is 6".
>
> Note that "integration" can be done wholly in the kernel too,
> without going to the overhead of streaming all dynamic events to
> user-space, just to summarize data into counts, in-kernel. That is
> what the ftrace statistics framework and various ftrace plugins are
> about.
>
> Also, it might make sense to extend the framework with a series of
> 'get current object state' events when tracing is turned on. A
> special case of _that_ would in essence be what the /proc hack does
> now - just expressed in a much more generic, and a much more usable
> form.
I guess the main question here is whether this approach will scale to
something like kmalloc() or the page allocator in production
environments. For any serious workload, the frequency of events is
going to be pretty high.
Pekka
> I guess the main question here is whether this approach will scale to
> something like kmalloc() or the page allocator in production
> environments. For any serious workload, the frequency of events is
> going to be pretty high.
Immediate Values patch series makes zero-overhead to tracepoint
while it's not used.
So, We have to implement to stop collect stastics way. it restore
zero overhead world.
We don't lose any performance by trace.
> > > I am one of most strongly want guys to MM tracepoint. but No, many
> > > cunstomer never permit to use drop_caches.
> >
> > See my other mail i just sent: it would be a natural extension of
> > tracing to also dump all current object state when tracing is turned
> > on. That way no drop_caches is needed at all.
>
> I can understand the merits here - I also did readahead
> tracing/accounting in _one_ piece of code. Very handy.
>
> The readahead traces are now raw printks - converting to the ftrace
> framework would be a big win.
>
> But. It's still not a fit-all solution. Imagine when full data _since_
> booting is required, but the user cannot afford a reboot.
>
> > But it has to be expressed in one framework that cares about the
> > totality of the kernel - not just these splintered bits of
> > instrumentation and pieces of statistics.
>
> Though minded to push the kpageflags interface, I totally agree the
> above fine principle and discipline :-)
Yeah.
I totally agree your claim.
I'm interest to both ftrace based readahead tracer and this patch :)
> But it has to be expressed in one framework that cares about the
> totality of the kernel - not just these splintered bits of
Can you perhaps expand a bit what code that framework would
provide to kpageflags? As far as I can see it only needs
a ->read callback from somewhere and I admit it's hard to see for me
how that could share much code with anything else.
-Andi
--
[email protected] -- Speaking for myself only.
Hi!
2009/4/28 KOSAKI Motohiro <[email protected]>:
>> I guess the main question here is whether this approach will scale to
>> something like kmalloc() or the page allocator in production
>> environments. For any serious workload, the frequency of events is
>> going to be pretty high.
>
> Immediate Values patch series makes zero-overhead to tracepoint
> while it's not used.
>
> So, We have to implement to stop collect stastics way. it restore
> zero overhead world.
> We don't lose any performance by trace.
Sure but I meant the _enabled_ case here. kmalloc() (and the page
allocator to some extent) is very performance sensitive in many
workloads so you probably don't want to use tracepoints if you're
collecting some overall statistics (i.e. tracing all events) like we
do here.
Pekka
* Pekka Enberg <[email protected]> wrote:
> Hi!
>
> 2009/4/28 KOSAKI Motohiro <[email protected]>:
> >> I guess the main question here is whether this approach will scale to
> >> something like kmalloc() or the page allocator in production
> >> environments. For any serious workload, the frequency of events is
> >> going to be pretty high.
> >
> > Immediate Values patch series makes zero-overhead to tracepoint
> > while it's not used.
> >
> > So, We have to implement to stop collect stastics way. it restore
> > zero overhead world.
> > We don't lose any performance by trace.
>
> Sure but I meant the _enabled_ case here. kmalloc() (and the page
> allocator to some extent) is very performance sensitive in many
> workloads so you probably don't want to use tracepoints if you're
> collecting some overall statistics (i.e. tracing all events) like
> we do here.
That's where 'collect current state' kind of tracepoints would help
- they could be used even without enabling any of the other
tracepoints. And they'd still be in a coherent whole with the
dynamic-events tracepoints.
So i'm not arguing against these techniques at all - and we can move
on a wide scale from zero-overhead to lots-of-tracing-enabled models
- what i'm arguing against is the splintering.
Ingo
* Pekka Enberg <[email protected]> wrote:
> Hi Ingo,
>
> On Tue, Apr 28, 2009 at 12:36 PM, Ingo Molnar <[email protected]> wrote:
> > I 'integrate' traces all the time to get summary counts. This series
> > of dynamic events:
> >
> > ?allocation
> > ?page count up
> > ?page count up
> > ?page count down
> > ?page count up
> > ?page count up
> > ?page count up
> > ?page count up
> >
> > integrates into: "page count is 6".
> >
> > Note that "integration" can be done wholly in the kernel too,
> > without going to the overhead of streaming all dynamic events to
> > user-space, just to summarize data into counts, in-kernel. That
> > is what the ftrace statistics framework and various ftrace
> > plugins are about.
> >
> > Also, it might make sense to extend the framework with a series
> > of 'get current object state' events when tracing is turned on.
> > A special case of _that_ would in essence be what the /proc hack
> > does now - just expressed in a much more generic, and a much
> > more usable form.
>
> I guess the main question here is whether this approach will scale
> to something like kmalloc() or the page allocator in production
> environments. For any serious workload, the frequency of events is
> going to be pretty high.
it depends on the level of integration. If the integration is done
right at the tracepoint callback, performance overhead will be very
small. If everything is traced and then streamed to user-space then
there is going to be noticeable overhead starting somewhere around a
few hundred thousand events per second per cpu.
Note that the 'get object state' approach i outlined above in the
final paragraph has no runtime overhead at all. As long as 'object
state' only covers fields that we maintain already for normal kernel
functionality, it costs nothing to allow the passive sampling of
that state. The /proc patch is a subset of such a facility in
essence.
Ingo
* Wu Fengguang <[email protected]> wrote:
> > See my other mail i just sent: it would be a natural extension
> > of tracing to also dump all current object state when tracing is
> > turned on. That way no drop_caches is needed at all.
>
> I can understand the merits here - I also did readahead
> tracing/accounting in _one_ piece of code. Very handy.
>
> The readahead traces are now raw printks - converting to the
> ftrace framework would be a big win.
>
> But. It's still not a fit-all solution. Imagine when full data
> _since_ booting is required, but the user cannot afford a reboot.
The above 'get object state' interface (which allows passive
sampling) - integrated into the tracing framework - would serve that
goal, agreed?
Ingo
2009/4/28 Ingo Molnar <[email protected]>:
>
> * Pekka Enberg <[email protected]> wrote:
>
>> Hi!
>>
>> 2009/4/28 KOSAKI Motohiro <[email protected]>:
>> >> I guess the main question here is whether this approach will scale to
>> >> something like kmalloc() or the page allocator in production
>> >> environments. For any serious workload, the frequency of events is
>> >> going to be pretty high.
>> >
>> > Immediate Values patch series makes zero-overhead to tracepoint
>> > while it's not used.
>> >
>> > So, We have to implement to stop collect stastics way. it restore
>> > zero overhead world.
>> > We don't lose any performance by trace.
>>
>> Sure but I meant the _enabled_ case here. kmalloc() (and the page
>> allocator to some extent) is very performance sensitive in many
>> workloads so you probably don't want to use tracepoints if you're
>> collecting some overall statistics (i.e. tracing all events) like
>> we do here.
>
> That's where 'collect current state' kind of tracepoints would help
> - they could be used even without enabling any of the other
> tracepoints. And they'd still be in a coherent whole with the
> dynamic-events tracepoints.
>
> So i'm not arguing against these techniques at all - and we can move
> on a wide scale from zero-overhead to lots-of-tracing-enabled models
> - what i'm arguing against is the splintering.
umm.
I guess Pekka and you talk about different thing.
if tracepoint is ON, tracepoint makes one function call. but few hot spot don't
have patience to one function call overhead.
scheduler stat and slab stat are one of good example, I think.
I really don't want convert slab_stat and sched_stat to ftrace base stastics.
currently it don't need extra function call and it only touch per-cpu variable.
So, a overhead is extream small.
Unfortunately, tracepoint still don't reach this extream performance.
On Tue, Apr 28, 2009 at 07:05:53PM +0800, Ingo Molnar wrote:
>
> * Wu Fengguang <[email protected]> wrote:
>
> > > See my other mail i just sent: it would be a natural extension
> > > of tracing to also dump all current object state when tracing is
> > > turned on. That way no drop_caches is needed at all.
> >
> > I can understand the merits here - I also did readahead
> > tracing/accounting in _one_ piece of code. Very handy.
> >
> > The readahead traces are now raw printks - converting to the
> > ftrace framework would be a big win.
> >
> > But. It's still not a fit-all solution. Imagine when full data
> > _since_ booting is required, but the user cannot afford a reboot.
>
> The above 'get object state' interface (which allows passive
> sampling) - integrated into the tracing framework - would serve that
> goal, agreed?
Agreed. That could in theory a good complement to dynamic tracings.
Then what will be the canonical form for all the 'get object state'
interfaces - "object.attr=value", or whatever? I'm afraid we will have
to sacrifice efficiency or human readability to have a normalized form.
Or to define two standard forms? One "key value" form and one "value1
value2 value3..." form?
Thanks,
Fengguang
* Wu Fengguang <[email protected]> wrote:
> > The above 'get object state' interface (which allows passive
> > sampling) - integrated into the tracing framework - would serve
> > that goal, agreed?
>
> Agreed. That could in theory a good complement to dynamic
> tracings.
>
> Then what will be the canonical form for all the 'get object
> state' interfaces - "object.attr=value", or whatever? [...]
Lemme outline what i'm thinking of.
I'd call the feature "object collection tracing", which would live
in /debug/tracing, accessed via such files:
/debug/tracing/objects/mm/pages/
/debug/tracing/objects/mm/pages/format
/debug/tracing/objects/mm/pages/filter
/debug/tracing/objects/mm/pages/trace_pipe
/debug/tracing/objects/mm/pages/stats
/debug/tracing/objects/mm/pages/events/
here's the (proposed) semantics of those files:
1) /debug/tracing/objects/mm/pages/
There's a subsystem / object basic directory structure to make it
easy and intuitive to find our way around there.
2) /debug/tracing/objects/mm/pages/format
the format file:
/debug/tracing/objects/mm/pages/format
Would reuse the existing dynamic-tracepoint structured-logging
descriptor format and code (this is upstream already):
[root@phoenix sched_signal_send]# pwd
/debug/tracing/events/sched/sched_signal_send
[root@phoenix sched_signal_send]# cat format
name: sched_signal_send
ID: 24
format:
field:unsigned short common_type; offset:0; size:2;
field:unsigned char common_flags; offset:2; size:1;
field:unsigned char common_preempt_count; offset:3; size:1;
field:int common_pid; offset:4; size:4;
field:int common_tgid; offset:8; size:4;
field:int sig; offset:12; size:4;
field:char comm[TASK_COMM_LEN]; offset:16; size:16;
field:pid_t pid; offset:32; size:4;
print fmt: "sig: %d task %s:%d", REC->sig, REC->comm, REC->pid
These format descriptors enumerate fields, types and sizes, in a
structured way that user-space tools can parse easily. (The binary
records that come from the trace_pipe file follow this format
description.)
3) /debug/tracing/objects/mm/pages/filter
This is the tracing filter that can be set based on the 'format'
descriptor. So with the above (signal-send tracepoint) you can
define such filter expressions:
echo "(sig == 10 && comm == bash) || sig == 13" > filter
To restrict the 'scope' of the object collection along pretty much
any key or combination of keys. (Or you can leave it as it is and
dump all objects and do keying in user-space.)
[ Using in-kernel filtering is obviously faster that streaming it
out to user-space - but there might be details and types of
visualization you want to do in user-space - so we dont want to
restrict things here. ]
For the mm object collection tracepoint i could imagine such filter
expressions:
echo "type == shared && file == /sbin/init" > filter
To dump all shared pages that are mapped to /sbin/init.
4) /debug/tracing/objects/mm/pages/trace_pipe
The 'trace_pipe' file can be used to dump all objects in the
collection, which match the filter ('all objects' by default). The
record format is described in 'format'.
trace_pipe would be a reuse of the existing trace_pipe code: it is a
modern, poll()-able, read()-able, splice()-able pipe abstraction.
5) /debug/tracing/objects/mm/pages/stats
The 'stats' file would be a reuse of the existing histogram code of
the tracing code. We already make use of it for the branch tracers
and for the workqueue tracer - it could be extended to be applicable
to object collections as well.
The advantage there would be that there's no dumping at all - all
the integration is done straight in the kernel. ( The 'filter'
condition is listened to - increasing flexibility. The filter file
could perhaps also act as a default histogram key. )
6) /debug/tracing/objects/mm/pages/events/
The 'events' directory offers links back to existing dynamic
tracepoints that are under /debug/tracing/events/. This would serve
as an additional coherent force that keeps dynamic tracepoints
collected by subsystem and by object type as well. (Tools could make
use of this information as well - without being aware of actual
object semantics.)
There would be a number of other object collections we could
enumerate:
tasks:
/debug/tracing/objects/sched/tasks/
active inodes known to the kernel:
/debug/tracing/objects/fs/inodes/
interrupts:
/debug/tracing/objects/hw/irqs/
etc.
These would use the same 'object collection' framework. Once done we
can use it for many other thing too.
Note how organically integrated it all is with the tracing
framework. You could start from an 'object view' to get an overview
and then go towards a more dynamic view of specific object
attributes (or specific objects), as you drill down on a specific
problem you want to analyze.
How does this all sound to you?
Can you see any conceptual holes in the scheme, any use-case that
/proc/kpageflags supports but the object collection approach does
not?
Would you be interested in seeing something like this, if we tried
to implement it in the tracing tree? The majority of the code
already exists, we just need interest from the MM side and we have
to hook it all up. (it is by no means trivial to do - but looks like
a very exciting feature.)
Thanks,
Ingo
* KOSAKI Motohiro <[email protected]> wrote:
> 2009/4/28 Ingo Molnar <[email protected]>:
> >
> > * Pekka Enberg <[email protected]> wrote:
> >
> >> Hi!
> >>
> >> 2009/4/28 KOSAKI Motohiro <[email protected]>:
> >> >> I guess the main question here is whether this approach will scale to
> >> >> something like kmalloc() or the page allocator in production
> >> >> environments. For any serious workload, the frequency of events is
> >> >> going to be pretty high.
> >> >
> >> > Immediate Values patch series makes zero-overhead to tracepoint
> >> > while it's not used.
> >> >
> >> > So, We have to implement to stop collect stastics way. it restore
> >> > zero overhead world.
> >> > We don't lose any performance by trace.
> >>
> >> Sure but I meant the _enabled_ case here. kmalloc() (and the page
> >> allocator to some extent) is very performance sensitive in many
> >> workloads so you probably don't want to use tracepoints if you're
> >> collecting some overall statistics (i.e. tracing all events) like
> >> we do here.
> >
> > That's where 'collect current state' kind of tracepoints would help
> > - they could be used even without enabling any of the other
> > tracepoints. And they'd still be in a coherent whole with the
> > dynamic-events tracepoints.
> >
> > So i'm not arguing against these techniques at all - and we can move
> > on a wide scale from zero-overhead to lots-of-tracing-enabled models
> > - what i'm arguing against is the splintering.
>
> umm.
> I guess Pekka and you talk about different thing.
>
> if tracepoint is ON, tracepoint makes one function call. but few
> hot spot don't have patience to one function call overhead.
>
> scheduler stat and slab stat are one of good example, I think.
>
> I really don't want convert slab_stat and sched_stat to ftrace
> base stastics. currently it don't need extra function call and it
> only touch per-cpu variable. So, a overhead is extream small.
>
> Unfortunately, tracepoint still don't reach this extream
> performance.
I understand that - please see my "[rfc] object collection tracing"
reply in this thread, for a more detailed description about what i
meant under 'object state tracing'.
Ingo
On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> tent-Transfer-Encoding: quoted-printable
> Status: RO
> Content-Length: 5480
> Lines: 161
>
>
> * Wu Fengguang <[email protected]> wrote:
>
> > > The above 'get object state' interface (which allows passive
> > > sampling) - integrated into the tracing framework - would serve
> > > that goal, agreed?
> >
> > Agreed. That could in theory a good complement to dynamic
> > tracings.
> >
> > Then what will be the canonical form for all the 'get object
> > state' interfaces - "object.attr=value", or whatever? [...]
>
> Lemme outline what i'm thinking of.
>
> I'd call the feature "object collection tracing", which would live
> in /debug/tracing, accessed via such files:
>
> /debug/tracing/objects/mm/pages/
> /debug/tracing/objects/mm/pages/format
> /debug/tracing/objects/mm/pages/filter
> /debug/tracing/objects/mm/pages/trace_pipe
> /debug/tracing/objects/mm/pages/stats
> /debug/tracing/objects/mm/pages/events/
>
> here's the (proposed) semantics of those files:
>
> 1) /debug/tracing/objects/mm/pages/
>
> There's a subsystem / object basic directory structure to make it
> easy and intuitive to find our way around there.
>
> 2) /debug/tracing/objects/mm/pages/format
>
> the format file:
>
> /debug/tracing/objects/mm/pages/format
>
> Would reuse the existing dynamic-tracepoint structured-logging
> descriptor format and code (this is upstream already):
>
> [root@phoenix sched_signal_send]# pwd
> /debug/tracing/events/sched/sched_signal_send
>
> [root@phoenix sched_signal_send]# cat format
> name: sched_signal_send
> ID: 24
> format:
> field:unsigned short common_type; offset:0; size:2;
> field:unsigned char common_flags; offset:2; size:1;
> field:unsigned char common_preempt_count; offset:3; size:1;
> field:int common_pid; offset:4; size:4;
> field:int common_tgid; offset:8; size:4;
>
> field:int sig; offset:12; size:4;
> field:char comm[TASK_COMM_LEN]; offset:16; size:16;
> field:pid_t pid; offset:32; size:4;
>
> print fmt: "sig: %d task %s:%d", REC->sig, REC->comm, REC->pid
>
> These format descriptors enumerate fields, types and sizes, in a
> structured way that user-space tools can parse easily. (The binary
> records that come from the trace_pipe file follow this format
> description.)
>
> 3) /debug/tracing/objects/mm/pages/filter
>
> This is the tracing filter that can be set based on the 'format'
> descriptor. So with the above (signal-send tracepoint) you can
> define such filter expressions:
>
> echo "(sig == 10 && comm == bash) || sig == 13" > filter
>
> To restrict the 'scope' of the object collection along pretty much
> any key or combination of keys. (Or you can leave it as it is and
> dump all objects and do keying in user-space.)
>
> [ Using in-kernel filtering is obviously faster that streaming it
> out to user-space - but there might be details and types of
> visualization you want to do in user-space - so we dont want to
> restrict things here. ]
>
> For the mm object collection tracepoint i could imagine such filter
> expressions:
>
> echo "type == shared && file == /sbin/init" > filter
>
> To dump all shared pages that are mapped to /sbin/init.
>
> 4) /debug/tracing/objects/mm/pages/trace_pipe
>
> The 'trace_pipe' file can be used to dump all objects in the
> collection, which match the filter ('all objects' by default). The
> record format is described in 'format'.
>
> trace_pipe would be a reuse of the existing trace_pipe code: it is a
> modern, poll()-able, read()-able, splice()-able pipe abstraction.
>
> 5) /debug/tracing/objects/mm/pages/stats
>
> The 'stats' file would be a reuse of the existing histogram code of
> the tracing code. We already make use of it for the branch tracers
> and for the workqueue tracer - it could be extended to be applicable
> to object collections as well.
>
> The advantage there would be that there's no dumping at all - all
> the integration is done straight in the kernel. ( The 'filter'
> condition is listened to - increasing flexibility. The filter file
> could perhaps also act as a default histogram key. )
>
> 6) /debug/tracing/objects/mm/pages/events/
>
> The 'events' directory offers links back to existing dynamic
> tracepoints that are under /debug/tracing/events/. This would serve
> as an additional coherent force that keeps dynamic tracepoints
> collected by subsystem and by object type as well. (Tools could make
> use of this information as well - without being aware of actual
> object semantics.)
>
>
> There would be a number of other object collections we could
> enumerate:
>
> tasks:
>
> /debug/tracing/objects/sched/tasks/
>
> active inodes known to the kernel:
>
> /debug/tracing/objects/fs/inodes/
>
> interrupts:
>
> /debug/tracing/objects/hw/irqs/
>
> etc.
>
> These would use the same 'object collection' framework. Once done we
> can use it for many other thing too.
>
> Note how organically integrated it all is with the tracing
> framework. You could start from an 'object view' to get an overview
> and then go towards a more dynamic view of specific object
> attributes (or specific objects), as you drill down on a specific
> problem you want to analyze.
>
> How does this all sound to you?
Great! I saw much opportunity to adapt the not yet submitted
/proc/filecache interface to the proposed framework.
Its basic form is:
# ino size cached cached% refcnt state age accessed process dev file
[snip]
320 1 4 100 1 D- 50443 1085 udevd 00:11(tmpfs) /.udev/uevent_seqnum
460725 123 124 100 35 -- 50444 6795 touch 08:02(sda2) /lib/libpthread-2.9.so
460727 31 32 100 14 -- 50444 2007 touch 08:02(sda2) /lib/librt-2.9.so
458865 97 80 82 1 -- 50444 49 mount 08:02(sda2) /lib/libdevmapper.so.1.02.1
460090 15 16 100 1 -- 50444 48 mount 08:02(sda2) /lib/libuuid.so.1.2
458866 46 48 100 1 -- 50444 47 mount 08:02(sda2) /lib/libblkid.so.1.0
460732 43 44 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_nis-2.9.so
460739 87 88 100 73 -- 50444 3597 rcS 08:02(sda2) /lib/libnsl-2.9.so
460726 31 32 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_compat-2.9.so
458804 250 252 100 11 -- 50445 8175 rcS 08:02(sda2) /lib/libncurses.so.5.6
229540 780 752 96 3 -- 50445 7594 init 08:02(sda2) /bin/bash
460735 15 16 100 89 -- 50445 17581 init 08:02(sda2) /lib/libdl-2.9.so
460721 1344 1340 99 117 -- 50445 48732 init 08:02(sda2) /lib/libc-2.9.so
458801 107 104 97 24 -- 50445 3586 init 08:02(sda2) /lib/libselinux.so.1
671870 37 24 65 1 -- 50446 1 swapper 08:02(sda2) /sbin/init
175 1 24412 100 1 -- 50446 0 swapper 00:01(rootfs) /dev/root
The patch basically does a traversal through one or more of the inode
lists to produce the output:
inode_in_use
inode_unused
sb->s_dirty
sb->s_io
sb->s_more_io
sb->s_inodes
The filtering feature is a necessity for this interface - or it will
take considerable time to do a full listing. It supports the following
filters:
{ LS_OPT_DIRTY, "dirty" },
{ LS_OPT_CLEAN, "clean" },
{ LS_OPT_INUSE, "inuse" },
{ LS_OPT_EMPTY, "empty" },
{ LS_OPT_ALL, "all" },
{ LS_OPT_DEV, "dev=%s" },
There are two possible challenges for the conversion:
- One trick it does is to select different lists to traverse on
different filter options. Will this be possible in the object
tracing framework?
- The file name lookup(last field) is the performance killer. Is it
possible to skip the file name lookup when the filter failed on the
leading fields?
Will the object tracing interface allow such flexibilities?
(Sorry I'm not yet familiar with the tracing framework.)
> Can you see any conceptual holes in the scheme, any use-case that
> /proc/kpageflags supports but the object collection approach does
> not?
kpageflags is simply a big (perhaps sparse) binary array.
I'd still prefer to retain its current form - the kernel patches and
user space tools are all ready made, and I see no benefits in
converting to the tracing framework.
> Would you be interested in seeing something like this, if we tried
> to implement it in the tracing tree? The majority of the code
> already exists, we just need interest from the MM side and we have
> to hook it all up. (it is by no means trivial to do - but looks like
> a very exciting feature.)
Definitely! /proc/filecache has another 'page view':
# head /proc/filecache
# file /bin/bash
# flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
# idx len state refcnt
0 1 RAMU________ 4
3 8 RAMU________ 4
12 1 RAMU________ 4
14 5 RAMU________ 4
20 7 RAMU________ 4
27 2 RAMU________ 5
29 1 RAMU________ 4
Which is also a good candidate. However I still need to investigate
whether it offers considerable margins over the mincore() syscall.
Thanks and Regards,
Fengguang
On Tue, 2009-04-28 at 11:36 +0200, Ingo Molnar wrote:
> I 'integrate' traces all the time to get summary counts. This series
> of dynamic events:
>
> allocation
> page count up
> page count up
> page count down
> page count up
> page count up
> page count up
> page count up
>
> integrates into: "page count is 6".
Perhaps you've failed calculus. The integral is 6 + C.
This is a critical distinction. Tracing is great for looking at changes,
but it completely falls down for static system-wide measurements because
it would require integrating from time=0 to get a meaningful summation.
That's completely useless for taking a measurement on a system that
already has an uptime of months.
Never mind that summing up page flag changes for every page on the
system since boot time through the trace interface is incredibly
wasteful given that we're keeping a per-page integral in the page tables
anyway.
Tracing is not the answer for everything.
--
http://selenic.com : development and support for Mercurial and Linux
On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> plain text document attachment (kpageflags-extending.patch)
> Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
My only concern with this patch is it knows a bit too much about SLUB
internals (and perhaps not enough about SLOB, which also overloads
flags).
--
http://selenic.com : development and support for Mercurial and Linux
On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <[email protected]> wrote:
> 1) FAST
>
> It takes merely 0.2s to scan 4GB pages:
>
> ? ? ? ?./page-types ?0.02s user 0.20s system 99% cpu 0.216 total
OK on a tiny system ... but sounds painful on a big
server. 0.2s for 4G scales up to 3 minutes 25 seconds
on a 4TB system (4TB systems were being sold two
years ago ... so by now the high end will have moved
up to 8TB or perhaps 16TB).
Would the resulting output be anything but noise on
a big system (a *lot* of pages can change state in
3 minutes)?
-Tony
On Tue, 2009-04-28 at 11:11 -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <[email protected]> wrote:
> > 1) FAST
> >
> > It takes merely 0.2s to scan 4GB pages:
> >
> > ./page-types 0.02s user 0.20s system 99% cpu 0.216 total
>
> OK on a tiny system ... but sounds painful on a big
> server. 0.2s for 4G scales up to 3 minutes 25 seconds
> on a 4TB system (4TB systems were being sold two
> years ago ... so by now the high end will have moved
> up to 8TB or perhaps 16TB).
>
> Would the resulting output be anything but noise on
> a big system (a *lot* of pages can change state in
> 3 minutes)?
Bah. The rate of change is proportional to #cpus, not #pages. Assuming
you've got 1024 processors, you could run the scan in parallel in .2
seconds still.
It won't be an atomic snapshot, obviously. But stopping the whole
machine on a system that size is probably not what you want anyway.
--
http://selenic.com : development and support for Mercurial and Linux
On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <[email protected]> wrote:
> Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> you've got 1024 processors, you could run the scan in parallel in .2
> seconds still.
That would help ... it would also make the patch to support this
functionality a lot more complex.
-Tony
On Tue, Apr 28, 2009 at 01:47:07PM -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <[email protected]> wrote:
> > Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> > you've got 1024 processors, you could run the scan in parallel in .2
> > seconds still.
>
> That would help ... it would also make the patch to support this
> functionality a lot more complex.
I suspect 4TB memory users are used to some things running
a little slower. I'm not sure we really need to make every obscure
debugging functionality scale well to these systems too.
-Andi
--
[email protected] -- Speaking for myself only.
On Tue, 2009-04-28 at 13:47 -0700, Tony Luck wrote:
> On Tue, Apr 28, 2009 at 11:34 AM, Matt Mackall <[email protected]> wrote:
> > Bah. The rate of change is proportional to #cpus, not #pages. Assuming
> > you've got 1024 processors, you could run the scan in parallel in .2
> > seconds still.
>
> That would help ... it would also make the patch to support this
> functionality a lot more complex.
The kernel bits should handle this already today. You just need 1k
userspace threads to open /proc/kpageflags, seek() appropriately, and
read().
--
http://selenic.com : development and support for Mercurial and Linux
On Tue, 28 Apr 2009 11:11:52 -0700
Tony Luck <[email protected]> wrote:
> On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <[email protected]> wrote:
> > 1) FAST
> >
> > It takes merely 0.2s to scan 4GB pages:
> >
> > __ __ __ __./page-types __0.02s user 0.20s system 99% cpu 0.216 total
>
> OK on a tiny system ... but sounds painful on a big
> server. 0.2s for 4G scales up to 3 minutes 25 seconds
> on a 4TB system (4TB systems were being sold two
> years ago ... so by now the high end will have moved
> up to 8TB or perhaps 16TB).
>
> Would the resulting output be anything but noise on
> a big system (a *lot* of pages can change state in
> 3 minutes)?
>
Reading the state of all of memory in this fashion would be a somewhat
peculiar thing to do. Bear in mind that kpagemap and friends are also
designed to allow userspace to inspect the state of a particular
process's memory.
Documentation/vm/pagemap.txt describes it nicely:
: The general procedure for using pagemap to find out about a process' memory
: usage goes like this:
:
: 1. Read /proc/pid/maps to determine which parts of the memory space are
: mapped to what.
: 2. Select the maps you are interested in -- all of them, or a particular
: library, or the stack or the heap, etc.
: 3. Open /proc/pid/pagemap and seek to the pages you would like to examine.
: 4. Read a u64 for each page from pagemap.
: 5. Open /proc/kpagecount and/or /proc/kpageflags. For each PFN you just
: read, seek to that entry in the file, and read the data you want.
although I expect that this is not the use case when the feature is
being used to debug/tune readahead.
But yes, if you have huge amounts of memory and you decide to write an
application which inspects the state of every physical page in the
machine, you can expect it to take a long time!
Of course, the VM does also accumulate bulk aggregated page statistics
and presents them in /proc/meminfo, /proc/vmstat and probably other
places. These numbers are maintained at runtime and the cost of doing
this is significant.
I don't _think_ there are presently any such counters which are
accumulated simply for instrumentation purposes - the kernel needs to
maintain them anyway for various reasons and it's a simple (and useful)
matter to make them available to userspace.
Generally, I think that pagemap is another of those things where we've
failed on the follow-through. There's a nice and powerful interface
for inspecting the state of a process's VM, but nobody knows about it
and there are no tools for accessing it and nobody is using it.
(Or maybe I'm wrong about that - I expect I'd have bugged Matt about
this and I expect that he'd have done something. Brain failed).
Either way, I think we'd serve the world better if we were to have some
nice little userspace tools which users could use to access this
information. Documentation/vm already has a Makefile!
Fengguang, you mention an executable called "page-types". Perhaps you
could "productise" that sometime?
A model here is Documentation/accounting/getdelays.c - that proved
quite useful and successful in the development of taskstats and I know
that several people are actually using getdelays.c as-is in serious
production environments. If we hadn't provided and maintained that
code in the kernel tree, it's unlikely that taskstats would have proved
as useful to users.
On Tue, 28 Apr 2009 09:09:12 +0800
Wu Fengguang <[email protected]> wrote:
> +/*
> + * Kernel flags are exported faithfully to Linus and his fellow hackers.
> + * Otherwise some details are masked to avoid confusing the end user:
> + * - some kernel flags are completely invisible
> + * - some kernel flags are conditionally invisible on their odd usages
> + */
> +#ifdef CONFIG_DEBUG_KERNEL
> +static inline int genuine_linus(void) { return 1; }
Although he's a fine chap, the use of the "_linus" tag isn't terribly
clear (to me). I think what you're saying here is that this enables
kernel-developer-only features, yes?
If so, perhaps we could come up with an identifier which expresses that
more clearly.
But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
for _some_ reason, so what's the point?
It is preferable that we always implement the same interface for all
Kconfig settings. If this exposes information which is confusing or
not useful to end-users then so be it - we should be able to cover that
in supporting documentation.
Also, as mentioned in the other email, it would be good if we were to
publish a little userspace app which people can use to access this raw
data. We could give that application an `--i-am-a-kernel-developer'
option!
> +#else
> +static inline int genuine_linus(void) { return 0; }
> +#endif
This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
debugging features. The way you've used it here, if the person who is
configuring the kernel wants to enable any other completely-unrelated
debug feature, they have to enable DEBUG_KERNEL first. But when they
do that, they unexpectedly alter the behaviour of pagemap!
There are two other places where CONFIG_DEBUG_KERNEL affects code
generation in .c files: arch/parisc/mm/init.c and
arch/powerpc/kernel/sysfs.c. These are both wrong, and need slapping ;)
> +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> + do { \
> + if (visible || genuine_linus()) \
> + uflags |= ((kflags >> kbit) & 1) << ubit; \
> + } while (0);
Did this have to be implemented as a macro?
It's bad, because it might or might not reference its argument, so if
someone passes it an expression-with-side-effects, the end result is
unpredictable. A C function is almost always preferable if possible.
> +/* a helper function _not_ intended for more general uses */
> +static inline int page_cap_writeback_dirty(struct page *page)
> +{
> + struct address_space *mapping;
> +
> + if (!PageSlab(page))
> + mapping = page_mapping(page);
> + else
> + mapping = NULL;
> +
> + return mapping && mapping_cap_writeback_dirty(mapping);
> +}
If the page isn't locked then page->mapping can be concurrently removed
and freed. This actually happened to me in real-life testing several
years ago.
On Tue, 2009-04-28 at 14:17 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 11:11:52 -0700
> Tony Luck <[email protected]> wrote:
>
> > On Tue, Apr 28, 2009 at 1:33 AM, Wu Fengguang <[email protected]> wrote:
> > > 1) FAST
> > >
> > > It takes merely 0.2s to scan 4GB pages:
> > >
> > > __ __ __ __./page-types __0.02s user 0.20s system 99% cpu 0.216 total
> >
> > OK on a tiny system ... but sounds painful on a big
> > server. 0.2s for 4G scales up to 3 minutes 25 seconds
> > on a 4TB system (4TB systems were being sold two
> > years ago ... so by now the high end will have moved
> > up to 8TB or perhaps 16TB).
> >
> > Would the resulting output be anything but noise on
> > a big system (a *lot* of pages can change state in
> > 3 minutes)?
> >
>
> Reading the state of all of memory in this fashion would be a somewhat
> peculiar thing to do.
Not entirely. If you've got, say, a large NUMA box, it could be
incredibly illustrative to see that "oh, this node is entirely dominated
by SLAB allocations". Or on a smaller machine "oh, this is fragmented to
hell and there's no way I'm going to get a huge page". Things you're not
going to get from individual stats.
> Generally, I think that pagemap is another of those things where we've
> failed on the follow-through. There's a nice and powerful interface
> for inspecting the state of a process's VM, but nobody knows about it
> and there are no tools for accessing it and nobody is using it.
People keep finding bugs in the thing exercising it in new ways, so I
presume people are writing their own tools. My hope was that my original
tools would inspire someone to take it and run with it - I really have
no stomach for writing GUI tools.
However, I've recent gone and written a pretty generically useful
command-line tool that hopefully will get more traction:
http://www.selenic.com/smem/
I'm expecting it to get written up on LWN shortly, so I haven't spent
much time doing my own advertising.
--
http://selenic.com : development and support for Mercurial and Linux
On Tue, 2009-04-28 at 14:32 -0700, Andrew Morton wrote:
> > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > + do { \
> > + if (visible || genuine_linus()) \
> > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > + } while (0);
>
> Did this have to be implemented as a macro?
I'm mostly to blame for that. I seem to recall the optimizer doing a
better job on this as a macro.
> It's bad, because it might or might not reference its argument, so if
> someone passes it an expression-with-side-effects, the end result is
> unpredictable. A C function is almost always preferable if possible.
I don't think there's any use case for it outside of its one user?
> > +/* a helper function _not_ intended for more general uses */
> > +static inline int page_cap_writeback_dirty(struct page *page)
> > +{
> > + struct address_space *mapping;
> > +
> > + if (!PageSlab(page))
> > + mapping = page_mapping(page);
> > + else
> > + mapping = NULL;
> > +
> > + return mapping && mapping_cap_writeback_dirty(mapping);
> > +}
>
> If the page isn't locked then page->mapping can be concurrently removed
> and freed. This actually happened to me in real-life testing several
> years ago.
We certainly don't want to be taking locks per page to build the flags
data here. As we don't have any pretense of being atomic, it's ok if we
can find a way to do the test that's inaccurate when a race occurs, so
long as it doesn't dereference null.
But if there's not an obvious way to do that, we should probably just
drop this flag bit for this iteration.
--
http://selenic.com : development and support for Mercurial and Linux
On Tue, 28 Apr 2009 17:46:34 -0500
Matt Mackall <[email protected]> wrote:
> > > +/* a helper function _not_ intended for more general uses */
> > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > +{
> > > + struct address_space *mapping;
> > > +
> > > + if (!PageSlab(page))
> > > + mapping = page_mapping(page);
> > > + else
> > > + mapping = NULL;
> > > +
> > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > +}
> >
> > If the page isn't locked then page->mapping can be concurrently removed
> > and freed. This actually happened to me in real-life testing several
> > years ago.
>
> We certainly don't want to be taking locks per page to build the flags
> data here. As we don't have any pretense of being atomic, it's ok if we
> can find a way to do the test that's inaccurate when a race occurs, so
> long as it doesn't dereference null.
>
> But if there's not an obvious way to do that, we should probably just
> drop this flag bit for this iteration.
trylock_page() could be used here, perhaps.
Then again, why _not_ just do lock_page()? After all, few pages are
ever locked. There will be latency if the caller stumbles across a
page which is under read I/O, but so be it?
On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 17:46:34 -0500
> Matt Mackall <[email protected]> wrote:
>
> > > > +/* a helper function _not_ intended for more general uses */
> > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > +{
> > > > + struct address_space *mapping;
> > > > +
> > > > + if (!PageSlab(page))
> > > > + mapping = page_mapping(page);
> > > > + else
> > > > + mapping = NULL;
> > > > +
> > > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > > +}
> > >
> > > If the page isn't locked then page->mapping can be concurrently removed
> > > and freed. This actually happened to me in real-life testing several
> > > years ago.
> >
> > We certainly don't want to be taking locks per page to build the flags
> > data here. As we don't have any pretense of being atomic, it's ok if we
> > can find a way to do the test that's inaccurate when a race occurs, so
> > long as it doesn't dereference null.
> >
> > But if there's not an obvious way to do that, we should probably just
> > drop this flag bit for this iteration.
>
> trylock_page() could be used here, perhaps.
>
> Then again, why _not_ just do lock_page()? After all, few pages are
> ever locked. There will be latency if the caller stumbles across a
> page which is under read I/O, but so be it?
As I mentioned just a bit ago, it's really not an unreasonable use case
to want to do this on every page in the system back to back. So per page
overhead matters. And the odds of stalling on a locked page when
visiting 1M pages while under load are probably not negligible.
Our lock primitives are pretty low overhead in the fast path, but every
cycle counts. The new tests and branches this code already adds are a
bit worrisome, but on balance probably worth it.
--
http://selenic.com : development and support for Mercurial and Linux
On Tue, 28 Apr 2009 18:31:09 -0500
Matt Mackall <[email protected]> wrote:
> On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > On Tue, 28 Apr 2009 17:46:34 -0500
> > Matt Mackall <[email protected]> wrote:
> >
> > > > > +/* a helper function _not_ intended for more general uses */
> > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > +{
> > > > > + struct address_space *mapping;
> > > > > +
> > > > > + if (!PageSlab(page))
> > > > > + mapping = page_mapping(page);
> > > > > + else
> > > > > + mapping = NULL;
> > > > > +
> > > > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > +}
> > > >
> > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > and freed. This actually happened to me in real-life testing several
> > > > years ago.
> > >
> > > We certainly don't want to be taking locks per page to build the flags
> > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > can find a way to do the test that's inaccurate when a race occurs, so
> > > long as it doesn't dereference null.
> > >
> > > But if there's not an obvious way to do that, we should probably just
> > > drop this flag bit for this iteration.
> >
> > trylock_page() could be used here, perhaps.
> >
> > Then again, why _not_ just do lock_page()? After all, few pages are
> > ever locked. There will be latency if the caller stumbles across a
> > page which is under read I/O, but so be it?
>
> As I mentioned just a bit ago, it's really not an unreasonable use case
> to want to do this on every page in the system back to back. So per page
> overhead matters. And the odds of stalling on a locked page when
> visiting 1M pages while under load are probably not negligible.
The chances of stalling on a locked page are pretty good, and the
duration of the stall might be long indeed. Perhaps a trylock is a
decent compromise - it depends on the value of this metric, and I've
forgotten what we're talking about ;)
umm, seems that this flag is needed to enable PG_error, PG_dirty,
PG_uptodate and PG_writeback reporting. So simply removing this code
would put a huge hole in the patchset, no?
> Our lock primitives are pretty low overhead in the fast path, but every
> cycle counts. The new tests and branches this code already adds are a
> bit worrisome, but on balance probably worth it.
That should be easy to quantify (hint).
On Tue, 2009-04-28 at 16:42 -0700, Andrew Morton wrote:
> On Tue, 28 Apr 2009 18:31:09 -0500
> Matt Mackall <[email protected]> wrote:
>
> > On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > > On Tue, 28 Apr 2009 17:46:34 -0500
> > > Matt Mackall <[email protected]> wrote:
> > >
> > > > > > +/* a helper function _not_ intended for more general uses */
> > > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > > +{
> > > > > > + struct address_space *mapping;
> > > > > > +
> > > > > > + if (!PageSlab(page))
> > > > > > + mapping = page_mapping(page);
> > > > > > + else
> > > > > > + mapping = NULL;
> > > > > > +
> > > > > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > > +}
> > > > >
> > > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > > and freed. This actually happened to me in real-life testing several
> > > > > years ago.
> > > >
> > > > We certainly don't want to be taking locks per page to build the flags
> > > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > > can find a way to do the test that's inaccurate when a race occurs, so
> > > > long as it doesn't dereference null.
> > > >
> > > > But if there's not an obvious way to do that, we should probably just
> > > > drop this flag bit for this iteration.
> > >
> > > trylock_page() could be used here, perhaps.
> > >
> > > Then again, why _not_ just do lock_page()? After all, few pages are
> > > ever locked. There will be latency if the caller stumbles across a
> > > page which is under read I/O, but so be it?
> >
> > As I mentioned just a bit ago, it's really not an unreasonable use case
> > to want to do this on every page in the system back to back. So per page
> > overhead matters. And the odds of stalling on a locked page when
> > visiting 1M pages while under load are probably not negligible.
>
> The chances of stalling on a locked page are pretty good, and the
> duration of the stall might be long indeed. Perhaps a trylock is a
> decent compromise - it depends on the value of this metric, and I've
> forgotten what we're talking about ;)
>
> umm, seems that this flag is needed to enable PG_error, PG_dirty,
> PG_uptodate and PG_writeback reporting. So simply removing this code
> would put a huge hole in the patchset, no?
We can report those bits anyway. But this patchset does something
clever: it filters irrelevant (and possibly overloaded) bits in various
contexts.
> > Our lock primitives are pretty low overhead in the fast path, but every
> > cycle counts. The new tests and branches this code already adds are a
> > bit worrisome, but on balance probably worth it.
>
> That should be easy to quantify (hint).
I'll let Fengguang address both these points.
--
http://selenic.com : development and support for Mercurial and Linux
On Tue, Apr 28, 2009 at 04:49:55PM -0500, Matt Mackall wrote:
> > Reading the state of all of memory in this fashion would be a somewhat
> > peculiar thing to do.
>
> Not entirely. If you've got, say, a large NUMA box, it could be
> incredibly illustrative to see that "oh, this node is entirely dominated
> by SLAB allocations". Or on a smaller machine "oh, this is fragmented to
> hell and there's no way I'm going to get a huge page". Things you're not
> going to get from individual stats.
I have, in the past, simply used grep on
/sys/devices/system/node/node*/meminfo and gotten the individual stats
I was concerned about. Not sure how much more detail would have been
needed or useful. I don't think I can recall a time where I needed to
write another tool.
Thanks,
Robin
On Wed, Apr 29, 2009 at 05:32:44AM +0800, Andrew Morton wrote:
> On Tue, 28 Apr 2009 09:09:12 +0800
> Wu Fengguang <[email protected]> wrote:
>
> > +/*
> > + * Kernel flags are exported faithfully to Linus and his fellow hackers.
> > + * Otherwise some details are masked to avoid confusing the end user:
> > + * - some kernel flags are completely invisible
> > + * - some kernel flags are conditionally invisible on their odd usages
> > + */
> > +#ifdef CONFIG_DEBUG_KERNEL
> > +static inline int genuine_linus(void) { return 1; }
>
> Although he's a fine chap, the use of the "_linus" tag isn't terribly
> clear (to me). I think what you're saying here is that this enables
> kernel-developer-only features, yes?
Yes.
> If so, perhaps we could come up with an identifier which expresses that
> more clearly.
>
> But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
> for _some_ reason, so what's the point?
Good point! I can confirm my debian has CONFIG_DEBUG_KERNEL=Y!
> It is preferable that we always implement the same interface for all
> Kconfig settings. If this exposes information which is confusing or
> not useful to end-users then so be it - we should be able to cover that
> in supporting documentation.
My original patch takes that straightforward manner - and I still like it.
I would be very glad to move the filtering code from kernel to user space.
The use of more obscure flags could be discouraged by _not_ documenting
them. A really curious user is encouraged to refer to the code for the
exact meaning (and perhaps become a kernel developer ;-)
> Also, as mentioned in the other email, it would be good if we were to
> publish a little userspace app which people can use to access this raw
> data. We could give that application an `--i-am-a-kernel-developer'
> option!
OK. I'll include page-types.c in the next take.
> > +#else
> > +static inline int genuine_linus(void) { return 0; }
> > +#endif
>
> This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
>
> DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
> debugging features. The way you've used it here, if the person who is
> configuring the kernel wants to enable any other completely-unrelated
> debug feature, they have to enable DEBUG_KERNEL first. But when they
> do that, they unexpectedly alter the behaviour of pagemap!
>
> There are two other places where CONFIG_DEBUG_KERNEL affects code
> generation in .c files: arch/parisc/mm/init.c and
> arch/powerpc/kernel/sysfs.c. These are both wrong, and need slapping ;)
(add cc to related maintainers)
CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means
#ifdef CONFIG_DEBUG_KERNEL == #if 1
as the following patch demos. Now it becomes obviously silly.
diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index 4356ceb..59fb910 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -368,19 +368,19 @@ static void __init setup_bootmem(void)
request_resource(&sysram_resources[0], &pdcdata_resource);
}
void free_initmem(void)
{
unsigned long addr, init_begin, init_end;
printk(KERN_INFO "Freeing unused kernel memory: ");
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
/* Attempt to catch anyone trying to execute code here
* by filling the page with BRK insns.
*
* If we disable interrupts for all CPUs, then IPI stops working.
* Kinda breaks the global cache flushing.
*/
local_irq_disable();
memset(__init_begin, 0x00,
@@ -519,19 +519,19 @@ void __init mem_init(void)
printk(KERN_INFO "Memory: %luk/%luk available (%dk kernel code, %dk reserved, %dk data, %dk init)\n",
(unsigned long)nr_free_pages() << (PAGE_SHIFT-10),
num_physpages << (PAGE_SHIFT-10),
codesize >> 10,
reservedpages << (PAGE_SHIFT-10),
datasize >> 10,
initsize >> 10
);
-#ifdef CONFIG_DEBUG_KERNEL /* double-sanity-check paranoia */
+#if 1 /* double-sanity-check paranoia */
printk("virtual kernel memory layout:\n"
" vmalloc : 0x%p - 0x%p (%4ld MB)\n"
" memory : 0x%p - 0x%p (%4ld MB)\n"
" .init : 0x%p - 0x%p (%4ld kB)\n"
" .data : 0x%p - 0x%p (%4ld kB)\n"
" .text : 0x%p - 0x%p (%4ld kB)\n",
(void*)VMALLOC_START, (void*)VMALLOC_END,
(VMALLOC_END - VMALLOC_START) >> 20,
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index f41aec8..0d54c6b 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -212,19 +212,19 @@ static SYSDEV_ATTR(purr, 0600, show_purr, store_purr);
#endif /* CONFIG_PPC64 */
#ifdef HAS_PPC_PMC_PA6T
SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
SYSFS_PMCSETUP(hid0, SPRN_HID0);
SYSFS_PMCSETUP(hid1, SPRN_HID1);
SYSFS_PMCSETUP(hid4, SPRN_HID4);
SYSFS_PMCSETUP(hid5, SPRN_HID5);
SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0);
SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1);
SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2);
SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3);
SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4);
@@ -282,19 +282,19 @@ static struct sysdev_attribute classic_pmc_attrs[] = {
static struct sysdev_attribute pa6t_attrs[] = {
_SYSDEV_ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
_SYSDEV_ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
_SYSDEV_ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
_SYSDEV_ATTR(pmc1, 0600, show_pa6t_pmc1, store_pa6t_pmc1),
_SYSDEV_ATTR(pmc2, 0600, show_pa6t_pmc2, store_pa6t_pmc2),
_SYSDEV_ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
_SYSDEV_ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
_SYSDEV_ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
-#ifdef CONFIG_DEBUG_KERNEL
+#if 1
_SYSDEV_ATTR(hid0, 0600, show_hid0, store_hid0),
_SYSDEV_ATTR(hid1, 0600, show_hid1, store_hid1),
_SYSDEV_ATTR(hid4, 0600, show_hid4, store_hid4),
_SYSDEV_ATTR(hid5, 0600, show_hid5, store_hid5),
_SYSDEV_ATTR(ima0, 0600, show_ima0, store_ima0),
_SYSDEV_ATTR(ima1, 0600, show_ima1, store_ima1),
_SYSDEV_ATTR(ima2, 0600, show_ima2, store_ima2),
_SYSDEV_ATTR(ima3, 0600, show_ima3, store_ima3),
_SYSDEV_ATTR(ima4, 0600, show_ima4, store_ima4),
> > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > + do { \
> > + if (visible || genuine_linus()) \
> > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > + } while (0);
>
> Did this have to be implemented as a macro?
>
> It's bad, because it might or might not reference its argument, so if
> someone passes it an expression-with-side-effects, the end result is
> unpredictable. A C function is almost always preferable if possible.
Just tried inline function, the code size is increased slightly:
text data bss dec hex filename
macro 1804 128 0 1932 78c fs/proc/page.o
inline 1828 128 0 1956 7a4 fs/proc/page.o
So I'll keep the macro, but add brackets to make it a bit safer.
Thanks,
Fengguang
On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <[email protected]> wrote:
> > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > > + do { \
> > > + if (visible || genuine_linus()) \
> > > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > > + } while (0);
> >
> > Did this have to be implemented as a macro?
> >
> > It's bad, because it might or might not reference its argument, so if
> > someone passes it an expression-with-side-effects, the end result is
> > unpredictable. A C function is almost always preferable if possible.
>
> Just tried inline function, the code size is increased slightly:
>
> text data bss dec hex filename
> macro 1804 128 0 1932 78c fs/proc/page.o
> inline 1828 128 0 1956 7a4 fs/proc/page.o
>
hm, I wonder why. Maybe it fixed a bug ;)
The code is effectively doing
if (expr1)
something();
if (expr1)
something_else();
if (expr1)
something_else2();
etc. Obviously we _hope_ that the compiler turns that into
if (expr1) {
something();
something_else();
something_else2();
}
for us, but it would be good to check...
On Wed, Apr 29, 2009 at 07:55:10AM +0800, Matt Mackall wrote:
> On Tue, 2009-04-28 at 16:42 -0700, Andrew Morton wrote:
> > On Tue, 28 Apr 2009 18:31:09 -0500
> > Matt Mackall <[email protected]> wrote:
> >
> > > On Tue, 2009-04-28 at 16:02 -0700, Andrew Morton wrote:
> > > > On Tue, 28 Apr 2009 17:46:34 -0500
> > > > Matt Mackall <[email protected]> wrote:
> > > >
> > > > > > > +/* a helper function _not_ intended for more general uses */
> > > > > > > +static inline int page_cap_writeback_dirty(struct page *page)
> > > > > > > +{
> > > > > > > + struct address_space *mapping;
> > > > > > > +
> > > > > > > + if (!PageSlab(page))
> > > > > > > + mapping = page_mapping(page);
> > > > > > > + else
> > > > > > > + mapping = NULL;
> > > > > > > +
> > > > > > > + return mapping && mapping_cap_writeback_dirty(mapping);
> > > > > > > +}
> > > > > >
> > > > > > If the page isn't locked then page->mapping can be concurrently removed
> > > > > > and freed. This actually happened to me in real-life testing several
> > > > > > years ago.
> > > > >
> > > > > We certainly don't want to be taking locks per page to build the flags
> > > > > data here. As we don't have any pretense of being atomic, it's ok if we
> > > > > can find a way to do the test that's inaccurate when a race occurs, so
> > > > > long as it doesn't dereference null.
> > > > >
> > > > > But if there's not an obvious way to do that, we should probably just
> > > > > drop this flag bit for this iteration.
> > > >
> > > > trylock_page() could be used here, perhaps.
> > > >
> > > > Then again, why _not_ just do lock_page()? After all, few pages are
> > > > ever locked. There will be latency if the caller stumbles across a
> > > > page which is under read I/O, but so be it?
> > >
> > > As I mentioned just a bit ago, it's really not an unreasonable use case
> > > to want to do this on every page in the system back to back. So per page
> > > overhead matters. And the odds of stalling on a locked page when
> > > visiting 1M pages while under load are probably not negligible.
> >
> > The chances of stalling on a locked page are pretty good, and the
> > duration of the stall might be long indeed. Perhaps a trylock is a
> > decent compromise - it depends on the value of this metric, and I've
> > forgotten what we're talking about ;)
> >
> > umm, seems that this flag is needed to enable PG_error, PG_dirty,
> > PG_uptodate and PG_writeback reporting. So simply removing this code
> > would put a huge hole in the patchset, no?
>
> We can report those bits anyway. But this patchset does something
> clever: it filters irrelevant (and possibly overloaded) bits in various
> contexts.
>
> > > Our lock primitives are pretty low overhead in the fast path, but every
> > > cycle counts. The new tests and branches this code already adds are a
> > > bit worrisome, but on balance probably worth it.
> >
> > That should be easy to quantify (hint).
>
> I'll let Fengguang address both these points.
A quick micro bench: 100 runs on another T7300@2GHz 2GB laptop:
user system total
no lock 0.270 22.850 23.607
trylock 0.310 25.890 26.484
+13.3% +12.2%
But anyway, the plan is to move filtering to user space and eliminate
the complex kernel logics.
The IO filtering is no longer possible in user space, but I didn't see
the error/dirty/writeback bits on this testing system. So I guess it
won't be a big loss.
The huge/gigantic page filtering is also not possible in user space.
So I tend to add a KPF_HUGE flag to distinguish (hardware supported)
huge pages from normal (software) compound pages. Any objections?
Thanks,
Fengguang
On Wed, Apr 29, 2009 at 10:55:27AM +0800, Andrew Morton wrote:
> On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <[email protected]> wrote:
>
> > > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > > > + do { \
> > > > + if (visible || genuine_linus()) \
> > > > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > > > + } while (0);
> > >
> > > Did this have to be implemented as a macro?
> > >
> > > It's bad, because it might or might not reference its argument, so if
> > > someone passes it an expression-with-side-effects, the end result is
> > > unpredictable. A C function is almost always preferable if possible.
> >
> > Just tried inline function, the code size is increased slightly:
> >
> > text data bss dec hex filename
> > macro 1804 128 0 1932 78c fs/proc/page.o
> > inline 1828 128 0 1956 7a4 fs/proc/page.o
> >
>
> hm, I wonder why. Maybe it fixed a bug ;)
>
> The code is effectively doing
>
> if (expr1)
> something();
> if (expr1)
> something_else();
> if (expr1)
> something_else2();
>
> etc. Obviously we _hope_ that the compiler turns that into
>
> if (expr1) {
> something();
> something_else();
> something_else2();
> }
>
> for us, but it would be good to check...
By 'expr1', you mean (visible || genuine_linus())?
No, I can confirm the inefficiency does not lie here.
I simplified the kpf_copy_bit() to
#define kpf_copy_bit(uflags, kflags, ubit, kbit) \
uflags |= (((kflags) >> (kbit)) & 1) << (ubit);
or
static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
{
return (((kflags) >> (kbit)) & 1) << (ubit);
}
and double checked the differences: the gap grows unexpectedly!
text data bss dec hex filename
macro 1829 168 0 1997 7cd fs/proc/page.o
inline 1893 168 0 2061 80d fs/proc/page.o
+3.5%
(note: the larger absolute text size is due to some experimental code elsewhere.)
Thanks,
Fengguang
Wu Fengguang <[email protected]> writes:
> On Wed, Apr 29, 2009 at 05:32:44AM +0800, Andrew Morton wrote:
>> On Tue, 28 Apr 2009 09:09:12 +0800
>> Wu Fengguang <[email protected]> wrote:
>>
>> > +/*
>> > + * Kernel flags are exported faithfully to Linus and his fellow hackers.
>> > + * Otherwise some details are masked to avoid confusing the end user:
>> > + * - some kernel flags are completely invisible
>> > + * - some kernel flags are conditionally invisible on their odd usages
>> > + */
>> > +#ifdef CONFIG_DEBUG_KERNEL
>> > +static inline int genuine_linus(void) { return 1; }
>>
>> Although he's a fine chap, the use of the "_linus" tag isn't terribly
>> clear (to me). I think what you're saying here is that this enables
>> kernel-developer-only features, yes?
>
> Yes.
>
>> If so, perhaps we could come up with an identifier which expresses that
>> more clearly.
>>
>> But I'd expect that everyone and all distros enable CONFIG_DEBUG_KERNEL
>> for _some_ reason, so what's the point?
At the least, it has not always been so...
>
> Good point! I can confirm my debian has CONFIG_DEBUG_KERNEL=Y!
I can confirm mine does not.
etch-i386:~# uname -a
Linux etch-i386 2.6.18-6-686 #1 SMP Fri Dec 12 16:48:28 UTC 2008 i686 GNU/Linux
etch-i386:~# grep DEBUG_KERNEL /boot/config-2.6.18-6-686
# CONFIG_DEBUG_KERNEL is not set
For what that's worth.
>> It is preferable that we always implement the same interface for all
>> Kconfig settings. If this exposes information which is confusing or
>> not useful to end-users then so be it - we should be able to cover that
>> in supporting documentation.
>
> My original patch takes that straightforward manner - and I still like it.
> I would be very glad to move the filtering code from kernel to user space.
>
> The use of more obscure flags could be discouraged by _not_ documenting
> them. A really curious user is encouraged to refer to the code for the
> exact meaning (and perhaps become a kernel developer ;-)
>
>> Also, as mentioned in the other email, it would be good if we were to
>> publish a little userspace app which people can use to access this raw
>> data. We could give that application an `--i-am-a-kernel-developer'
>> option!
>
> OK. I'll include page-types.c in the next take.
>
>> > +#else
>> > +static inline int genuine_linus(void) { return 0; }
>> > +#endif
>>
>> This isn't an appropriate use of CONFIG_DEBUG_KERNEL.
>>
>> DEBUG_KERNEL is a Kconfig-only construct which is use to enable _other_
>> debugging features. The way you've used it here, if the person who is
>> configuring the kernel wants to enable any other completely-unrelated
>> debug feature, they have to enable DEBUG_KERNEL first. But when they
>> do that, they unexpectedly alter the behaviour of pagemap!
>>
>> There are two other places where CONFIG_DEBUG_KERNEL affects code
>> generation in .c files: arch/parisc/mm/init.c and
>> arch/powerpc/kernel/sysfs.c. These are both wrong, and need slapping ;)
>
> (add cc to related maintainers)
I assume I was cc'd because I've changed arch/powerpc/kernel/sysfs.c a
couple of times in the last year, but I can't claim to maintain that
code. I'm pretty sure I haven't touched the code in question in this
discussion. I've cc'd linuxppc-dev.
> CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means
>
> #ifdef CONFIG_DEBUG_KERNEL == #if 1
>
> as the following patch demos. Now it becomes obviously silly.
Sure, #if 1 is usually silly. But if the point is that DEBUG_KERNEL is
not supposed to directly affect code generation, then I see two options
for powerpc:
- remove the #ifdef CONFIG_DEBUG_KERNEL guards from
arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
sysfs attributes, or
- define a new config symbol which governs whether those attributes are
enabled, and make it depend on DEBUG_KERNEL
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -212,19 +212,19 @@ static SYSDEV_ATTR(purr, 0600, show_purr, store_purr);
> #endif /* CONFIG_PPC64 */
>
> #ifdef HAS_PPC_PMC_PA6T
> SYSFS_PMCSETUP(pa6t_pmc0, SPRN_PA6T_PMC0);
> SYSFS_PMCSETUP(pa6t_pmc1, SPRN_PA6T_PMC1);
> SYSFS_PMCSETUP(pa6t_pmc2, SPRN_PA6T_PMC2);
> SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
> SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
> SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
> SYSFS_PMCSETUP(hid0, SPRN_HID0);
> SYSFS_PMCSETUP(hid1, SPRN_HID1);
> SYSFS_PMCSETUP(hid4, SPRN_HID4);
> SYSFS_PMCSETUP(hid5, SPRN_HID5);
> SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0);
> SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1);
> SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2);
> SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3);
> SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4);
> @@ -282,19 +282,19 @@ static struct sysdev_attribute classic_pmc_attrs[] = {
> static struct sysdev_attribute pa6t_attrs[] = {
> _SYSDEV_ATTR(mmcr0, 0600, show_mmcr0, store_mmcr0),
> _SYSDEV_ATTR(mmcr1, 0600, show_mmcr1, store_mmcr1),
> _SYSDEV_ATTR(pmc0, 0600, show_pa6t_pmc0, store_pa6t_pmc0),
> _SYSDEV_ATTR(pmc1, 0600, show_pa6t_pmc1, store_pa6t_pmc1),
> _SYSDEV_ATTR(pmc2, 0600, show_pa6t_pmc2, store_pa6t_pmc2),
> _SYSDEV_ATTR(pmc3, 0600, show_pa6t_pmc3, store_pa6t_pmc3),
> _SYSDEV_ATTR(pmc4, 0600, show_pa6t_pmc4, store_pa6t_pmc4),
> _SYSDEV_ATTR(pmc5, 0600, show_pa6t_pmc5, store_pa6t_pmc5),
> -#ifdef CONFIG_DEBUG_KERNEL
> +#if 1
> _SYSDEV_ATTR(hid0, 0600, show_hid0, store_hid0),
> _SYSDEV_ATTR(hid1, 0600, show_hid1, store_hid1),
> _SYSDEV_ATTR(hid4, 0600, show_hid4, store_hid4),
> _SYSDEV_ATTR(hid5, 0600, show_hid5, store_hid5),
> _SYSDEV_ATTR(ima0, 0600, show_ima0, store_ima0),
> _SYSDEV_ATTR(ima1, 0600, show_ima1, store_ima1),
> _SYSDEV_ATTR(ima2, 0600, show_ima2, store_ima2),
> _SYSDEV_ATTR(ima3, 0600, show_ima3, store_ima3),
> _SYSDEV_ATTR(ima4, 0600, show_ima4, store_ima4),
>
On Tue, 28 Apr 2009 23:41:52 -0500 Nathan Lynch <[email protected]> wrote:
> > CONFIG_DEBUG_KERNEL being enabled in distro kernels effectively means
> >
> > #ifdef CONFIG_DEBUG_KERNEL == #if 1
> >
> > as the following patch demos. Now it becomes obviously silly.
>
> Sure, #if 1 is usually silly. But if the point is that DEBUG_KERNEL is
> not supposed to directly affect code generation, then I see two options
> for powerpc:
>
> - remove the #ifdef CONFIG_DEBUG_KERNEL guards from
> arch/powerpc/kernel/sysfs.c, unconditionally enabling the hid/ima
> sysfs attributes, or
>
> - define a new config symbol which governs whether those attributes are
> enabled, and make it depend on DEBUG_KERNEL
yup.
On Wed, Apr 29, 2009 at 11:48:29AM +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 10:55:27AM +0800, Andrew Morton wrote:
> > On Wed, 29 Apr 2009 10:38:42 +0800 Wu Fengguang <[email protected]> wrote:
> >
> > > > > +#define kpf_copy_bit(uflags, kflags, visible, ubit, kbit) \
> > > > > + do { \
> > > > > + if (visible || genuine_linus()) \
> > > > > + uflags |= ((kflags >> kbit) & 1) << ubit; \
> > > > > + } while (0);
> > > >
> > > > Did this have to be implemented as a macro?
> > > >
> > > > It's bad, because it might or might not reference its argument, so if
> > > > someone passes it an expression-with-side-effects, the end result is
> > > > unpredictable. A C function is almost always preferable if possible.
> > >
> > > Just tried inline function, the code size is increased slightly:
> > >
> > > text data bss dec hex filename
> > > macro 1804 128 0 1932 78c fs/proc/page.o
> > > inline 1828 128 0 1956 7a4 fs/proc/page.o
> > >
> >
> > hm, I wonder why. Maybe it fixed a bug ;)
> >
> > The code is effectively doing
> >
> > if (expr1)
> > something();
> > if (expr1)
> > something_else();
> > if (expr1)
> > something_else2();
> >
> > etc. Obviously we _hope_ that the compiler turns that into
> >
> > if (expr1) {
> > something();
> > something_else();
> > something_else2();
> > }
> >
> > for us, but it would be good to check...
>
> By 'expr1', you mean (visible || genuine_linus())?
>
> No, I can confirm the inefficiency does not lie here.
>
> I simplified the kpf_copy_bit() to
>
> #define kpf_copy_bit(uflags, kflags, ubit, kbit) \
> uflags |= (((kflags) >> (kbit)) & 1) << (ubit);
>
> or
>
> static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
> {
> return (((kflags) >> (kbit)) & 1) << (ubit);
> }
>
> and double checked the differences: the gap grows unexpectedly!
>
> text data bss dec hex filename
> macro 1829 168 0 1997 7cd fs/proc/page.o
> inline 1893 168 0 2061 80d fs/proc/page.o
> +3.5%
>
> (note: the larger absolute text size is due to some experimental code elsewhere.)
Wow, after simplifications the text size goes down by -13.2%:
text data bss dec hex filename
macro 1644 8 0 1652 674 fs/proc/page.o
inline 1644 8 0 1652 674 fs/proc/page.o
Amazingly we can now use inline function without performance penalty!
Thanks,
Fengguang
On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > plain text document attachment (kpageflags-extending.patch)
> > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
>
> My only concern with this patch is it knows a bit too much about SLUB
> internals (and perhaps not enough about SLOB, which also overloads
> flags).
Yup. PG_private=PG_slob_free is not masked because SLOB actually does
not set PG_slab at all. I wonder if it's safe to do this change:
/* SLOB */
- PG_slob_page = PG_active,
+ PG_slob_page = PG_slab,
PG_slob_free = PG_private,
In the page-types output:
flags page-count MB symbolic-flags long-symbolic-flags
0x000800000040 7113 27 ______A_________________P____ active,private
0x000000000040 66 0 ______A______________________ active
The above two lines are obviously for SLOB pages. It indicates lots of
free SLOB pages. So my question is:
- Do you have other means to get the nr_free_slobs info? (I found none in the code)
or
- Will exporting the SL*B overloaded flags going to help?
Thanks,
Fengguang
On Wed, 2009-04-29 at 16:05 +0800, Wu Fengguang wrote:
> On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> > On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > > plain text document attachment (kpageflags-extending.patch)
> > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> >
> > My only concern with this patch is it knows a bit too much about SLUB
> > internals (and perhaps not enough about SLOB, which also overloads
> > flags).
>
> Yup. PG_private=PG_slob_free is not masked because SLOB actually does
> not set PG_slab at all. I wonder if it's safe to do this change:
>
> /* SLOB */
> - PG_slob_page = PG_active,
> + PG_slob_page = PG_slab,
> PG_slob_free = PG_private,
Yep.
> In the page-types output:
>
> flags page-count MB symbolic-flags long-symbolic-flags
> 0x000800000040 7113 27 ______A_________________P____ active,private
> 0x000000000040 66 0 ______A______________________ active
>
> The above two lines are obviously for SLOB pages. It indicates lots of
> free SLOB pages. So my question is:
Free here just means partially allocated.
> - Do you have other means to get the nr_free_slobs info? (I found none in the code)
> or
> - Will exporting the SL*B overloaded flags going to help?
Yes, it's useful.
--
http://selenic.com : development and support for Mercurial and Linux
On Thu, Apr 30, 2009 at 03:13:56AM +0800, Matt Mackall wrote:
> On Wed, 2009-04-29 at 16:05 +0800, Wu Fengguang wrote:
> > On Wed, Apr 29, 2009 at 01:49:21AM +0800, Matt Mackall wrote:
> > > On Tue, 2009-04-28 at 09:09 +0800, Wu Fengguang wrote:
> > > > plain text document attachment (kpageflags-extending.patch)
> > > > Export 9 page flags in /proc/kpageflags, and 8 more for kernel developers.
> > >
> > > My only concern with this patch is it knows a bit too much about SLUB
> > > internals (and perhaps not enough about SLOB, which also overloads
> > > flags).
> >
> > Yup. PG_private=PG_slob_free is not masked because SLOB actually does
> > not set PG_slab at all. I wonder if it's safe to do this change:
> >
> > /* SLOB */
> > - PG_slob_page = PG_active,
> > + PG_slob_page = PG_slab,
> > PG_slob_free = PG_private,
>
> Yep.
OK. I'll do it - for consistency.
> > In the page-types output:
> >
> > flags page-count MB symbolic-flags long-symbolic-flags
> > 0x000800000040 7113 27 ______A_________________P____ active,private
> > 0x000000000040 66 0 ______A______________________ active
> >
> > The above two lines are obviously for SLOB pages. It indicates lots of
> > free SLOB pages. So my question is:
>
> Free here just means partially allocated.
Yes, I realized this when lying in bed ;-)
> > - Do you have other means to get the nr_free_slobs info? (I found none in the code)
> > or
> > - Will exporting the SL*B overloaded flags going to help?
>
> Yes, it's useful.
Thank you. SLUB/SLOB overload different page flags, so it's possible
for user space tools to restore their real meanings - ugly but useful.
Thanks,
Fengguang