LinuxLists.cc - 10.31 second kernel compile

2002-03-13 10:12:47

Subject: 10.31 second kernel compile

Let the kernel compile benchmarks continue!

hardware: 24 way logical partition, 1.1GHz POWER4, 60G RAM

kernel: 2.5.6 + ppc64 pagetable rework

kernel compiled: 2.4.18 x86 with Martin's config

compiler: gcc 2.95.3 x86 cross compiler

# MAKE="make -j14" /usr/bin/time make -j14 bzImage
...
make[1]: Leaving directory `/home/anton/intel_kernel/linux/arch/i386/boot'
130.63user 71.31system 0:10.31elapsed 1957%CPU (0avgtext+0avgdata 0maxresident)k

Due to the final link and compress stage, there is a fair amount of idle
time at the end of the run. Its going to be hard to push that number
lower by adding cpus.

The profile results below show that kernel time is dominated by the low
level ppc64 pagetable management. We are working to correct this, a lot
of the overhead in __hash_page should be gone soon. The rest of the
profile looks pretty good, do_anonymous_page and lru_cache_add show
up high as they did in Martin's results.

Thanks to Milton Miller who helped with the benchmarking, and the ppc64
team!

Anton
--
[email protected]
[email protected]

201150 total 0.0668
129051 .idled

43586 .__hash_page ppc64 specific
6714 .local_flush_tlb_range ppc64 specific
2773 .local_flush_tlb_page ppc64 specific

2203 .do_anonymous_page
2059 .lru_cache_add
1379 .__copy_tofrom_user

1220 .hpte_create_valid_pSeriesLP ppc64 LPAR specific

1039 .save_remaining_regs
871 .do_page_fault

575 .plpar_hcall ppc64 LPAR specific

554 .d_lookup
545 .rmqueue
482 .copy_page
475 .__strnlen_user
391 .__free_pages_ok
389 .zap_page_range
366 .atomic_dec_and_lock
296 .__find_get_page
287 .set_page_dirty
278 .page_cache_release
218 .handle_mm_fault
199 .__flush_dcache_icache
175 .schedule
173 .sys_brk
163 .exit_notify
156 .do_no_page
152 .lru_cache_del
147 .__wake_up
146 .copy_page_range
139 .ppc_irq_dispatch_handler
135 .find_vma
131 .__lru_cache_del
128 .fget
126 .link_path_walk
115 .do_generic_file_read
113 .pte_alloc_map
113 .filemap_nopage
109 .clear_user_page
106 .fput
104 .__alloc_pages
101 .nr_free_pages

2002-03-13 14:44:56

by Martin J. Bligh

[permalink] [raw]

Subject: Re: 10.31 second kernel compile

> make[1]: Leaving directory `/home/anton/intel_kernel/linux/arch/i386/boot'
> 130.63user 71.31system 0:10.31elapsed 1957%CPU (0avgtext+0avgdata 0maxresident)k

Wow! Is this box NUMA (what latency ratio, mem access speeds, etc?), or can you really build a straight SMP that big?

OK, now I'm going to have to build a bigger system ;-)

> Due to the final link and compress stage, there is a fair amount of idle
> time at the end of the run. Its going to be hard to push that number
> lower by adding cpus.

I think we need to fix the final phase .... anyone got any ideas
on parallelizing that?

> The profile results below show that kernel time is dominated by the low
> level ppc64 pagetable management. We are working to correct this, a lot
> of the overhead in __hash_page should be gone soon. The rest of the
> profile looks pretty good, do_anonymous_page and lru_cache_add show
> up high as they did in Martin's results.

I have some strange plans for the lru stuff, but it'll take me a while.
I'm curious as to why lru_cache_del is so much lower in your list than
add, whereas the ratio for me is about:

719 lru_cache_add 7.8152
477 lru_cache_del 21.6818

> 6714 .local_flush_tlb_range ppc64 specific
> 2773 .local_flush_tlb_page ppc64 specific

Do you know what's causing the tlb flushes? Just context switches?

> 554 .d_lookup

Did you try the dcache patches?

Can you publish lockmeter stats?

Thanks,

M.

2002-03-13 21:45:29

by Dave Hansen

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Martin J. Bligh wrote:
>>Due to the final link and compress stage, there is a fair amount of idle
>>time at the end of the run. Its going to be hard to push that number
>>lower by adding cpus.
>
> I think we need to fix the final phase .... anyone got any ideas
> on parallelizing that?
The final linking stage in the makefile looks like this:

vmlinux: piggy.o $(OBJECTS)
$(LD) $(ZLINKFLAGS) -o vmlinux $(OBJECTS) piggy.o

ld has a "-r" option
`--relocateable'
Generate relocatable output---i.e., generate an output
file that can in turn serve as input to `ld'. This is
often called partial linking. As a side effect, in
environments that support standard Unix magic numbers,
this option also sets the output file's magic number
to `OMAGIC'. If this option is not specified, an
absolute file is produced. When linking C++ programs,
this option will not resolve references to construc?
tors; to do that, use -Ur.

If we link in chunks, we can parallelize this.
Image 26 object files: [a-z].o

ld -r -o abcd.o [abcd].o
ld -r -o efgh.o [efgh].o
...
ld -r -o abcdefgh.o {abcd,efgh,...}.o

then, instead of the old final link stage:
$(LD) $(ZLINKFLAGS) -o vmlinux {abcdefgh,...}.o piggy.o

The final link will still take a while, but we will have at least broken
up SOME of the work. I'm going to see if this will actually work now.
Any comments?

--
Dave Hansen
[email protected]

2002-03-14 01:08:11

by Keith Owens

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Wed, 13 Mar 2002 13:44:43 -0800,
Dave Hansen <[email protected]> wrote:
>The final linking stage in the makefile looks like this:
>
>vmlinux: piggy.o $(OBJECTS)
> $(LD) $(ZLINKFLAGS) -o vmlinux $(OBJECTS) piggy.o
>
>If we link in chunks, we can parallelize this.
>Image 26 object files: [a-z].o
>
>ld -r -o abcd.o [abcd].o
>ld -r -o efgh.o [efgh].o
>...
>ld -r -o abcdefgh.o {abcd,efgh,...}.o
>
>then, instead of the old final link stage:
>$(LD) $(ZLINKFLAGS) -o vmlinux {abcdefgh,...}.o piggy.o
>
>The final link will still take a while, but we will have at least broken
>up SOME of the work. I'm going to see if this will actually work now.
>Any comments?

I'm sorry Dave, you can't do that ;) The init_call order is controlled
by link order, change the link order and you corrupt the kernel
initialization order, double plus ungood. The link of vmlinux requires
that $(OBJECTS) be exactly as coded.

2002-03-14 12:52:53

by Anton Blanchard

[permalink] [raw]

Subject: Re: 10.31 second kernel compile

> Wow! Is this box NUMA (what latency ratio, mem access speeds, etc?), or
> can you really build a straight SMP that big?

The system has two cpus on a chip sharing a l2 cache, 4 of these chips
built into a multichip module and up to 4 of these modules connected
together. The l3 and memory is distributed amongst the modules.

As you can imagine there is a hierarchy of latencies but the ratio is
quite low.

> OK, now I'm going to have to build a bigger system ;-)

I've got 8 more cpus in store too :)

> I have some strange plans for the lru stuff, but it'll take me a while.
> I'm curious as to why lru_cache_del is so much lower in your list than
> add, whereas the ratio for me is about:
>
> 719 lru_cache_add 7.8152
> 477 lru_cache_del 21.6818

Not sure about that.

> > 6714 .local_flush_tlb_range ppc64 specific
> > 2773 .local_flush_tlb_page ppc64 specific
>
> Do you know what's causing the tlb flushes? Just context switches?

Thats due to the way we manipulate the ppc hashed page table. Every
time we update the linux page tables we have to update the hashed
page table. There are some obvious optimisations we need to make,
hopefully then this will go away. The tlb flushes here are probably
process exits and things like COW faults.

> > 554 .d_lookup
>
> Did you try the dcache patches?

Not for this, I did do some benchmarking of the RCU dcache patches a
while ago which I should post.

> Can you publish lockmeter stats?

I didnt get a chance to run lockmeter, I tend to use the kernel profiler
and use a hacked readprofile (originally from tridge) that displays
profile hits vs assembly instruction. Thats usually good enough to work
out which spinlocks are a problem.

Anton

2002-03-14 13:13:42

by Dipankar Sarma

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Thu, Mar 14, 2002 at 10:27:26PM +1100, Anton Blanchard wrote:
>
> > > 554 .d_lookup
> >
> > Did you try the dcache patches?
>
> Not for this, I did do some benchmarking of the RCU dcache patches a
> while ago which I should post.

Please do ;-) This shows why we need to ease the pressure on dcache_lock.

>
> > Can you publish lockmeter stats?
>
> I didnt get a chance to run lockmeter, I tend to use the kernel profiler
> and use a hacked readprofile (originally from tridge) that displays
> profile hits vs assembly instruction. Thats usually good enough to work
> out which spinlocks are a problem.

Is this a PPC only hack ? Also, where can I get it ?

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-03-14 13:18:51

by Momchil Velikov

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

>>>>> "Anton" == Anton Blanchard <[email protected]> writes:
Anton> Thats due to the way we manipulate the ppc hashed page table. Every
Anton> time we update the linux page tables we have to update the hashed
Anton> page table. There are some obvious optimisations we need to make,

Out of curiousity, why there's a need to update the linux page tables ?
Doesn't pte/pmd/pgd family functions provide enough abstraction in
order to maintain _only_ the hashed page table ?

Regards,
-velco

2002-03-14 18:21:17

by Hanna Linder

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

--On Thursday, March 14, 2002 22:27:26 +1100 Anton Blanchard <[email protected]> wrote:

>> > 554 .d_lookup
>>
>> Did you try the dcache patches?
>
> Not for this, I did do some benchmarking of the RCU dcache patches a
> while ago which I should post.
>

There are two dcache patches. The one I wrote based on Al Viro's
suggestion for fast path walking is especially good for NUMA
systems hit by cache bouncing (d_lookup is the main culprit in
the dcache). Martin had some initial results that looked very
good.

Following is the clean 2.5.6 version which is also available at:
http://sf.net/projects/lse Under the Read Copy Update Section

Hanna Linder ([email protected])
IBM Linux Technology Center

-----------
diff -Nru -X dontdiff linux-2.5.6/fs/dcache.c linux-2.5.6-fw/fs/dcache.c
--- linux-2.5.6/fs/dcache.c Thu Mar 7 18:18:13 2002
+++ linux-2.5.6-fw/fs/dcache.c Fri Mar 8 13:50:43 2002
@@ -704,13 +704,22 @@

struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
{
+ struct dentry * dentry;
+ spin_lock(&dcache_lock);
+ dentry = __d_lookup(parent,name);
+ spin_unlock(&dcache_lock);
+ return dentry;
+}
+
+struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
+{
+
unsigned int len = name->len;
unsigned int hash = name->hash;
const unsigned char *str = name->name;
struct list_head *head = d_hash(parent,hash);
struct list_head *tmp;

- spin_lock(&dcache_lock);
tmp = head->next;
for (;;) {
struct dentry * dentry = list_entry(tmp, struct dentry, d_hash);
@@ -732,10 +741,8 @@
}
__dget_locked(dentry);
dentry->d_vfs_flags |= DCACHE_REFERENCED;
- spin_unlock(&dcache_lock);
return dentry;
}
- spin_unlock(&dcache_lock);
return NULL;
}

diff -Nru -X dontdiff linux-2.5.6/fs/namei.c linux-2.5.6-fw/fs/namei.c
--- linux-2.5.6/fs/namei.c Thu Mar 7 18:18:24 2002
+++ linux-2.5.6-fw/fs/namei.c Fri Mar 8 13:56:25 2002
@@ -268,8 +268,41 @@
static struct dentry * cached_lookup(struct dentry * parent, struct qstr * name, int flags)
{
struct dentry * dentry = d_lookup(parent, name);
+
+ if (dentry && dentry->d_op && dentry->d_op->d_revalidate) {
+ if (!dentry->d_op->d_revalidate(dentry, flags) && !d_invalidate(dentry)) {
+ dput(dentry);
+ dentry = NULL;
+ }
+ }
+ return dentry;
+}

+/*for fastwalking*/
+static inline void undo_locked(struct nameidata *nd)
+{
+ if(nd->flags & LOOKUP_LOCKED){
+ dget(nd->dentry);
+ mntget(nd->mnt);
+ spin_unlock(&dcache_lock);
+ nd->flags &= ~LOOKUP_LOCKED;
+ }
+}
+
+/*
+ * For fast path lookup while holding the dcache_lock.
+ * SMP-safe
+ */
+static struct dentry * cached_lookup_nd(struct nameidata * nd, struct qstr * name, int flags)
+{
+ struct dentry * dentry = NULL;
+ if(!(nd->flags & LOOKUP_LOCKED))
+ return cached_lookup(nd->dentry, name, flags);
+
+ dentry = __d_lookup(nd->dentry, name);
+
if (dentry && dentry->d_op && dentry->d_op->d_revalidate) {
+ undo_locked(nd);
if (!dentry->d_op->d_revalidate(dentry, flags) && !d_invalidate(dentry)) {
dput(dentry);
dentry = NULL;
@@ -279,6 +312,34 @@
}

/*
+ * Short-cut version of permission(), for calling by
+ * path_walk(), when dcache lock is held. Combines parts
+ * of permission() and vfs_permission(), and tests ONLY for
+ * MAY_EXEC permission.
+ *
+ * If appropriate, check DAC only. If not appropriate, or
+ * short-cut DAC fails, then call permission() to do more
+ * complete permission check.
+ */
+static inline int exec_permission_lite(struct inode *inode)
+{
+ umode_t mode = inode->i_mode;
+
+ if ((inode->i_op && inode->i_op->permission))
+ return -EACCES;
+
+ if (current->fsuid == inode->i_uid)
+ mode >>= 6;
+ else if (in_group_p(inode->i_gid))
+ mode >>= 3;
+
+ if (mode & MAY_EXEC)
+ return 0;
+
+ return -EACCES;
+}
+
+/*
* This is called when everything else fails, and we actually have
* to go to the low-level filesystem to find out what we should do..
*
@@ -472,7 +533,9 @@
struct qstr this;
unsigned int c;

- err = permission(inode, MAY_EXEC);
+ err = exec_permission_lite(inode);
+ if(err)
+ err = permission(inode, MAY_EXEC);
dentry = ERR_PTR(err);
if (err)
break;
@@ -507,6 +570,7 @@
case 2:
if (this.name[1] != '.')
break;
+ undo_locked(nd);
follow_dotdot(nd);
inode = nd->dentry->d_inode;
/* fallthrough */
@@ -523,16 +587,20 @@
break;
}
/* This does the actual lookups.. */
- dentry = cached_lookup(nd->dentry, &this, LOOKUP_CONTINUE);
+ dentry = cached_lookup_nd(nd, &this, LOOKUP_CONTINUE);
if (!dentry) {
+ undo_locked(nd);
dentry = real_lookup(nd->dentry, &this, LOOKUP_CONTINUE);
err = PTR_ERR(dentry);
if (IS_ERR(dentry))
break;
}
/* Check mountpoints.. */
- while (d_mountpoint(dentry) && __follow_down(&nd->mnt, &dentry))
- ;
+ if(d_mountpoint(dentry)){
+ undo_locked(nd);
+ while (d_mountpoint(dentry) && __follow_down(&nd->mnt, &dentry))
+ ;
+ }

err = -ENOENT;
inode = dentry->d_inode;
@@ -543,6 +611,7 @@
goto out_dput;

if (inode->i_op->follow_link) {
+ undo_locked(nd);
err = do_follow_link(dentry, nd);
dput(dentry);
if (err)
@@ -555,7 +624,8 @@
if (!inode->i_op)
break;
} else {
- dput(nd->dentry);
+ if (!(nd->flags & LOOKUP_LOCKED))
+ dput(nd->dentry);
nd->dentry = dentry;
}
err = -ENOTDIR;
@@ -575,6 +645,7 @@
case 2:
if (this.name[1] != '.')
break;
+ undo_locked(nd);
follow_dotdot(nd);
inode = nd->dentry->d_inode;
/* fallthrough */
@@ -586,7 +657,8 @@
if (err < 0)
break;
}
- dentry = cached_lookup(nd->dentry, &this, 0);
+ dentry = cached_lookup_nd(nd, &this, 0);
+ undo_locked(nd);
if (!dentry) {
dentry = real_lookup(nd->dentry, &this, 0);
err = PTR_ERR(dentry);
@@ -626,11 +698,14 @@
else if (this.len == 2 && this.name[1] == '.')
nd->last_type = LAST_DOTDOT;
return_base:
+ undo_locked(nd);
return 0;
out_dput:
+ undo_locked(nd);
dput(dentry);
break;
}
+ undo_locked(nd);
path_release(nd);
return_err:
return err;
@@ -734,6 +809,36 @@
nd->dentry = dget(current->fs->pwd);
read_unlock(&current->fs->lock);
return 1;
+}
+
+int path_lookup(const char *name, unsigned int flags, struct nameidata *nd)
+{
+ nd->last_type = LAST_ROOT; /* if there are only slashes... */
+ nd->flags = flags;
+ if (*name=='/'){
+ read_lock(&current->fs->lock);
+ if (current->fs->altroot && !(nd->flags & LOOKUP_NOALT)) {
+ nd->mnt = mntget(current->fs->altrootmnt);
+ nd->dentry = dget(current->fs->altroot);
+ read_unlock(&current->fs->lock);
+ if (__emul_lookup_dentry(name,nd))
+ return 0;
+ read_lock(&current->fs->lock);
+ }
+ spin_lock(&dcache_lock); /*to avoid cacheline bouncing with d_count*/
+ nd->mnt = current->fs->rootmnt;
+ nd->dentry = current->fs->root;
+ read_unlock(&current->fs->lock);
+ }
+ else{
+ read_lock(&current->fs->lock);
+ spin_lock(&dcache_lock);
+ nd->mnt = current->fs->pwdmnt;
+ nd->dentry = current->fs->pwd;
+ read_unlock(&current->fs->lock);
+ }
+ nd->flags |= LOOKUP_LOCKED;
+ return (path_walk(name, nd));
}

/*
diff -Nru -X dontdiff linux-2.5.6/include/linux/dcache.h linux-2.5.6-fw/include/linux/dcache.h
--- linux-2.5.6/include/linux/dcache.h Thu Mar 7 18:18:30 2002
+++ linux-2.5.6-fw/include/linux/dcache.h Fri Mar 8 13:50:43 2002
@@ -220,6 +220,7 @@

/* appendix may either be NULL or be used for transname suffixes */
extern struct dentry * d_lookup(struct dentry *, struct qstr *);
+extern struct dentry * __d_lookup(struct dentry *, struct qstr *);

/* validate "insecure" dentry pointer */
extern int d_validate(struct dentry *, struct dentry *);
diff -Nru -X dontdiff linux-2.5.6/include/linux/fs.h linux-2.5.6-fw/include/linux/fs.h
--- linux-2.5.6/include/linux/fs.h Thu Mar 7 18:18:19 2002
+++ linux-2.5.6-fw/include/linux/fs.h Fri Mar 8 13:50:43 2002
@@ -1273,12 +1273,15 @@
* - require a directory
* - ending slashes ok even for nonexistent files
* - internal "there are more path compnents" flag
+ * - locked when lookup done with dcache_lock held
*/
#define LOOKUP_FOLLOW (1)
#define LOOKUP_DIRECTORY (2)
#define LOOKUP_CONTINUE (4)
#define LOOKUP_PARENT (16)
#define LOOKUP_NOALT (32)
+#define LOOKUP_LOCKED (64)
+
/*
* Type of the last component on LOOKUP_PARENT
*/
@@ -1309,13 +1312,7 @@
extern int FASTCALL(path_init(const char *, unsigned, struct nameidata *));
extern int FASTCALL(path_walk(const char *, struct nameidata *));
extern int FASTCALL(link_path_walk(const char *, struct nameidata *));
-static inline int path_lookup(const char *path, unsigned flags, struct nameidata *nd)
-{
- int error = 0;
- if (path_init(path, flags, nd))
- error = path_walk(path, nd);
- return error;
-}
+extern int FASTCALL(path_lookup(const char *, unsigned, struct nameidata *));
extern void path_release(struct nameidata *);
extern int follow_down(struct vfsmount **, struct dentry **);
extern int follow_up(struct vfsmount **, struct dentry **);

2002-03-14 18:39:07

by Daniel Phillips

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On March 14, 2002 02:21 pm, Momchil Velikov wrote:
> >>>>> "Anton" == Anton Blanchard <[email protected]> writes:
> Anton> Thats due to the way we manipulate the ppc hashed page table. Every
> Anton> time we update the linux page tables we have to update the hashed
> Anton> page table. There are some obvious optimisations we need to make,
>
> Out of curiousity, why there's a need to update the linux page tables ?
> Doesn't pte/pmd/pgd family functions provide enough abstraction in
> order to maintain _only_ the hashed page table ?

No, it's hardwired to the x86 tree view of page translation.

--
Daniel

2002-03-14 19:07:43

by Linus Torvalds

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

In article <[email protected]>,
Momchil Velikov <[email protected]> wrote:
>
>Out of curiousity, why there's a need to update the linux page tables ?
>Doesn't pte/pmd/pgd family functions provide enough abstraction in
>order to maintain _only_ the hashed page table ?

No. The IBM hashed page tables are not page tables at all, they are
really just a bigger 16-way set-associative in-memory TLB.

You can't actually sanely keep track of VM layout in them.

Those POWER4 machines are wonderful things, but they have a few quirks:

- it's so expensive that anybody who is slightly price-conscious gets a
farm of PC's instead. Oh, well.

- the CPU module alone is something like .5 kilowatts (translation:
don't expect it in a nice desktop factor, even if you could afford
it).

- IBM nomenclature really is broken. They call disks DASD devices, and
they call their hash table a page table, and they just confuse
themselves and everybody else for no good reason. They number bits
the wrong way around, for example (and big-endian bitordering really
_is_ clearly inferior to little-endian, unlike byte-ordering. Watch
the _same_ bits in the _same_ register change name in the 32 vs
64-bit architecture manuals, and puke)

But with all their faults, they do have this really studly setup with 8
big, fast CPU's on a single module. A few of those modules and you get
some ass-kick performance numbers. As you can see.

Linus

2002-03-15 07:13:26

by Chris Wedgwood

[permalink] [raw]

Subject: Re: 10.31 second kernel compile

On Wed, Mar 13, 2002 at 06:44:56AM -0800, Martin J. Bligh wrote:

I think we need to fix the final phase .... anyone got any ideas
on parallelizing that?

Redefine the benchmark not to include the final link :)

--cw

2002-03-15 12:17:27

by Chris Wedgwood

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

On Thu, Mar 14, 2002 at 07:33:40PM +0100, Daniel Phillips wrote:

On March 14, 2002 02:21 pm, Momchil Velikov wrote:

> Out of curiousity, why there's a need to update the linux page
> tables ? Doesn't pte/pmd/pgd family functions provide enough
> abstraction in order to maintain _only_ the hashed page table ?

No, it's hardwired to the x86 tree view of page translation.

What about doing soft TLB reloads then?

--cw

2002-03-15 18:22:08

by Linus Torvalds

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

In article <E16la2m-0000SX-00@starship>,
Daniel Phillips <[email protected]> wrote:
>On March 14, 2002 02:21 pm, Momchil Velikov wrote:
>>
>> Out of curiousity, why there's a need to update the linux page tables ?
>> Doesn't pte/pmd/pgd family functions provide enough abstraction in
>> order to maintain _only_ the hashed page table ?
>
>No, it's hardwired to the x86 tree view of page translation.

No no no.

If you think that, then you don't see the big picture.

In fact, when I did the 3-level page tables for Linux, no x86 chips that
could _use_ three levels actually existed.

The Linux MM was actually _designed_ for portability when I did the port
to alpha (oh, that's a long time ago). I even wrote my masters thesis on
why it was done the way it was done (the only actual academic use I ever
got out of the whole Linux exercise ;)

Yes a tree-based page table matches a lot of hardware architectures very
well. And it's _not_ just x86: it also matches soft-fill TLB's better
than alternatives (newer sparcs and MIPS), and matches a number of other
architecture specifications (eg alpha, m68k).

So on about 50% of architectures (and 99.9% of machines), the Linux MM
data structures can be made to map 1:1 to the hardware constructs, so
that you avoid duplicate information.

But more importantly than that, the whole point really is that the page
table tree as far as Linux is concerned is nothing but an _abstraction_
of the VM mapping hardware. It so happens that a tree format is the only
sane format to keep full VM information that works well with real loads.

Whatever the hardware actually does, Linux considers that to be noting
but an extended TLB. When you can make the MM software tree map 1:1
with the extended TLB (as on x86), you win in memory usage and in
cheaper TLB invalidates, but you _could_ (if you wanted to) just keep
two separate trees. In fact, with the rmap patches, that's exactly what
you see: the software tree is _not_ 1:1 with the hardare tree any more
(but it _is_ a proper superset, so that you can still get partial
sharing and still get the cheaper TLB updates).

Are there machines where the sharing between the software abstraction
and the hardware isn't as total? Sure. But if you actually know how
hashed page tables work on ppc, you'd be aware of the fact that they
aren't actualy able to do a full VM mapping - when a hash chain gets too
long, the hardware is no longer able to look it up ("too long" being 16
entries on a PPC, for example).

And that's a common situation with non-tree VM representations - they
aren't actually VM representations, they are just caches of what the
_real_ representation is. And what do we call such caches? Right: they
are nothing but a TLB.

So the fact is, the Linux tree-based VM has _nothing_ to do with x86
tree-basedness, and everything to do with the fact that it's the only
sane way to keep VM information.

The fact that it maps 1:1 to the x86 trees with the "folding" of the mid
layer was a design consideration, for sure. Being efficient and clever
is always good. But the basic reason for tree-ness lies elsewhere.
(The basic reasons for tree-ness is why so many architectures _do_ use a
tree-based page table - you should think of PPC and ia64 as the sick
puppies who didn't understand. Read the PPC documentation on virtual
memory, and you'll see just _how_ sick they are).

Linus

2002-03-16 05:15:27

by Anton Blanchard

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

> What about doing soft TLB reloads then?

ppc32 linux preloads entries into the hashed pagetable in
update_mmu_cache. Im about to commit a patch to do the same thing
in ppc64, at the moment we take two exceptions per pagefault which
is pretty ugly.

Some ppc32 hardware does allow you to take an exception for a TLB miss
(ie bypass the hashed pagetable completely).

Anton

2002-03-16 05:30:43

by Anton Blanchard

[permalink] [raw]

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Hi,

> There are two dcache patches. The one I wrote based on Al Viro's
> suggestion for fast path walking is especially good for NUMA
> systems hit by cache bouncing (d_lookup is the main culprit in
> the dcache). Martin had some initial results that looked very
> good.

I gave the patch a go, here is the before and after for the kernel
compile benchmark. As you can see d_lookup and atomic_dec_and_lock
have both dropped. Since the main bottleneck for us is still the
ppc64 mm code, we didnt see a noticable drop in wall clock time.

It would be interesting to try the patch on a large specweb run,
Ive seen the dcache lock become a problem when running 8 way specweb.

Anton

before:
155912 total 0.0550
114562 .cpu_idle
12615 .local_flush_tlb_range
8476 .local_flush_tlb_page
2576 .insert_hpte_into_group
1980 .do_anonymous_page
1813 .lru_cache_add
1390 .d_lookup
1320 .__copy_tofrom_user
1140 .save_remaining_regs
612 .rmqueue
517 .atomic_dec_and_lock
492 .do_page_fault
444 .copy_page
438 .__free_pages_ok
375 .set_page_dirty
350 .zap_page_range
314 .schedule
270 .__find_get_page
245 .page_cache_release
233 .lru_cache_del
231 .hvc_poll
215 .sys_brk

after:
152844 total 0.0539
113527 .cpu_idle
12740 .local_flush_tlb_range
7701 .local_flush_tlb_page
2564 .insert_hpte_into_group
2099 .do_anonymous_page
1780 .lru_cache_add
1230 .__copy_tofrom_user
1082 .save_remaining_regs
581 .rmqueue
486 .__free_pages_ok
479 .do_page_fault
465 .copy_page
371 .zap_page_range
333 .atomic_dec_and_lock
332 .set_page_dirty
286 .__find_get_page
275 .__d_lookup
263 .path_lookup
250 .page_cache_release
221 .lru_cache_del
218 .sys_brk
215 .__flush_dcache_icache

2002-03-16 06:16:34

by Anton Blanchard

[permalink] [raw]

Subject: 7.52 second kernel compile

> Let the kernel compile benchmarks continue!

I think Im addicted. I need help!

In this update we added 8 cpus and rewrote the ppc64 pagetable management
code to do lockless inserts and removals (there is still locking at
the pte level to avoid races).

hardware: 32 way logical partition, 1.1GHz POWER4, 60G RAM

kernel: 2.5.7-pre1 + ppc64 pagetable rework

kernel compiled: 2.4.18 x86 with Martin's config

compiler: gcc 2.95.3 x86 cross compiler

make[1]: Leaving directory `/home/anton/intel_kernel/linux/arch/i386/boot'
128.89user 40.23system 0:07.52elapsed 2246%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (437084major+572835minor)pagefaults 0swaps

7.52 seconds is not a bad result for something running under a hypervisor.
The profile looks much better now. We still spend a lot of time flushing tlb
entries but we can look into batching them.

Anton
--
[email protected]
[email protected]

155912 total 0.0550
114562 .cpu_idle

12615 .local_flush_tlb_range
8476 .local_flush_tlb_page
2576 .insert_hpte_into_group

1980 .do_anonymous_page
1813 .lru_cache_add
1390 .d_lookup
1320 .__copy_tofrom_user
1140 .save_remaining_regs
612 .rmqueue
517 .atomic_dec_and_lock
492 .do_page_fault
444 .copy_page
438 .__free_pages_ok
375 .set_page_dirty
350 .zap_page_range
314 .schedule
270 .__find_get_page
245 .page_cache_release
233 .lru_cache_del
231 .hvc_poll
215 .sys_brk

2002-03-16 06:44:28

by Gerrit Huizenga

[permalink] [raw]

Subject: Re: [Lse-tech] 7.52 second kernel compile

And this *without* the dcache_lock? Hmm. So you are saying there
may still be room for improvement?

BTW, are you doing this all out of cache/memory or do you have a
disk/controller quick enough to do the initial little file reads that
fast?

gerrit

In message <20020316061535.GA16653@krispykreme>, > : Anton Blanchard writes:
>
> > Let the kernel compile benchmarks continue!
>
> I think Im addicted. I need help!
>
> In this update we added 8 cpus and rewrote the ppc64 pagetable management
> code to do lockless inserts and removals (there is still locking at
> the pte level to avoid races).
>
> hardware: 32 way logical partition, 1.1GHz POWER4, 60G RAM
>
> kernel: 2.5.7-pre1 + ppc64 pagetable rework
>
> kernel compiled: 2.4.18 x86 with Martin's config
>
> compiler: gcc 2.95.3 x86 cross compiler
>
> make[1]: Leaving directory `/home/anton/intel_kernel/linux/arch/i386/boot'
> 128.89user 40.23system 0:07.52elapsed 2246%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (437084major+572835minor)pagefaults 0swaps
>
> 7.52 seconds is not a bad result for something running under a hypervisor.
> The profile looks much better now. We still spend a lot of time flushing tlb
> entries but we can look into batching them.
>
> Anton
> --
> [email protected]
> [email protected]
>
> 155912 total 0.0550
> 114562 .cpu_idle
>
> 12615 .local_flush_tlb_range
> 8476 .local_flush_tlb_page
> 2576 .insert_hpte_into_group
>
> 1980 .do_anonymous_page
> 1813 .lru_cache_add
> 1390 .d_lookup
> 1320 .__copy_tofrom_user
> 1140 .save_remaining_regs
> 612 .rmqueue
> 517 .atomic_dec_and_lock
> 492 .do_page_fault
> 444 .copy_page
> 438 .__free_pages_ok
> 375 .set_page_dirty
> 350 .zap_page_range
> 314 .schedule
> 270 .__find_get_page
> 245 .page_cache_release
> 233 .lru_cache_del
> 231 .hvc_poll
> 215 .sys_brk
>
> _______________________________________________
> Lse-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/lse-tech

2002-03-16 08:07:18

by Linus Torvalds

[permalink] [raw]

Subject: 10.31 second kernel compile

Subject: Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: 7.52 second kernel compile

Subject: Re: [Lse-tech] 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] 7.52 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] 7.52 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: [Lse-tech] 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: [Lse-tech] Re: 10.31 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: RE: 7.52 second kernel compile

Subject: Re: [Lse-tech] 7.52 second kernel compile

Subject: Re: [Lse-tech] 7.52 second kernel compile

Subject: some RCU dcache and ratcache results

Subject: Re: [Lse-tech] 7.52 second kernel compile

Subject: Re: [Lse-tech] 7.52 second kernel compile

Subject: Re: [Lse-tech] 7.52 second kernel compile

Subject: Re: [Lse-tech] 7.52 second kernel compile

Subject: 0.73 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile

Subject: Re: 7.52 second kernel compile