2003-02-20 04:15:55

by Martin J. Bligh

[permalink] [raw]
Subject: Performance of partial object-based rmap

The performance delta between 2.5.62-mjb1 and 2.5.62-mjb2 is caused
by the partial object-based rmap patch (written by Dave McCracken).
I expect this patch to have an increasing impact on workloads with
more processes, and it should give a substantial space saving as
well as a performance increase. Results from 16x NUMA-Q system ...

Profile comparison:

before
15525 page_remove_rmap
6415 page_add_rmap

after
2055 page_add_rmap
1983 page_remove_rmap

(performance of 62 was equivalent to 61)

Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus)
Elapsed User System CPU
2.5.61 45.77 561.71 118.87 1486.50
2.5.62-mjb1 45.81 564.41 112.76 1478.00
2.5.62-mjb2 44.88 563.64 98.29 1474.50

Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus)
Elapsed User System CPU
2.5.61 47.46 565.70 144.77 1496.33
2.5.62-mjb1 47.21 569.17 139.55 1500.67
2.5.62-mjb2 46.09 568.19 121.83 1496.67

DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.

Results are shown as percentages of the first set displayed

SDET 1 (see disclaimer)
Throughput Std. Dev
2.5.61 100.0% 4.5%
2.5.62-mjb1 95.2% 3.0%
2.5.62-mjb2 97.1% 4.6%

SDET 2 (see disclaimer)
Throughput Std. Dev
2.5.61 100.0% 4.3%
2.5.62-mjb1 93.4% 6.6%
2.5.62-mjb2 99.4% 6.1%

SDET 4 (see disclaimer)
Throughput Std. Dev
2.5.61 100.0% 2.4%
2.5.62-mjb1 101.9% 7.1%
2.5.62-mjb2 116.9% 1.5%

SDET 8 (see disclaimer)
Throughput Std. Dev
2.5.61 100.0% 7.4%
2.5.62-mjb1 111.1% 1.8%
2.5.62-mjb2 121.8% 5.9%

SDET 16 (see disclaimer)
Throughput Std. Dev
2.5.61 100.0% 2.0%
2.5.62-mjb1 105.0% 2.2%
2.5.62-mjb2 107.0% 2.1%

SDET 32 (see disclaimer)
Throughput Std. Dev
2.5.61 100.0% 7.9%
2.5.62-mjb1 112.4% 1.2%
2.5.62-mjb2 116.8% 0.9%

SDET 64 (see disclaimer)
Throughput Std. Dev
2.5.61 100.0% 1.2%
2.5.62-mjb1 111.0% 0.7%
2.5.62-mjb2 116.0% 0.9%


2003-02-20 04:38:41

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Performance of partial object-based rmap

On Wed, Feb 19, 2003 at 08:25:52PM -0800, Martin J. Bligh wrote:
> Profile comparison:
> before
> 15525 page_remove_rmap
> 6415 page_add_rmap
> after
> 2055 page_add_rmap
> 1983 page_remove_rmap

Could I get a larger, multiplicative differential profile?
i.e. ratios of the fractions of profile hits?

If you have trouble generating such I can do so myself from
fuller profile results.


Thanks.


-- wli

2003-02-20 05:04:13

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Performance of partial object-based rmap

>> Profile comparison:
>> before
>> 15525 page_remove_rmap
>> 6415 page_add_rmap
>> after
>> 2055 page_add_rmap
>> 1983 page_remove_rmap
>
> Could I get a larger, multiplicative differential profile?
> i.e. ratios of the fractions of profile hits?
>
> If you have trouble generating such I can do so myself from
> fuller profile results.

before:
187256 total
after:
170196 total

M.


2003-02-20 05:07:07

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Performance of partial object-based rmap

At some point in the past, I wrote:
>> Could I get a larger, multiplicative differential profile?
>> i.e. ratios of the fractions of profile hits?
>> If you have trouble generating such I can do so myself from
>> fuller profile results.

On Wed, Feb 19, 2003 at 09:14:10PM -0800, Martin J. Bligh wrote:
> before:
> 187256 total
> after:
> 170196 total

Well, I was trying to get an idea of what got slower to compensate
for the improvement in page_(add|remove)_rmap() times.


-- wli

2003-02-20 05:17:43

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Performance of partial object-based rmap

>>> Could I get a larger, multiplicative differential profile?
>>> i.e. ratios of the fractions of profile hits?
>>> If you have trouble generating such I can do so myself from
>>> fuller profile results.
>
> On Wed, Feb 19, 2003 at 09:14:10PM -0800, Martin J. Bligh wrote:
>> before:
>> 187256 total
>> after:
>> 170196 total
>
> Well, I was trying to get an idea of what got slower to compensate
> for the improvement in page_(add|remove)_rmap() times.

diffprofile:

1514 default_idle
921 .text.lock.file_table
145 vm_enough_memory
131 release_pages
93 file_move
68 __copy_to_user_ll
62 fd_install
60 buffered_rmqueue
58 vma_merge
52 __fput
...
-65 kmem_cache_alloc
-72 copy_page_range
-73 do_anonymous_page
-76 get_empty_filp
-98 filemap_nopage
-168 unmap_all_pages
-352 find_get_page
-402 kmem_cache_free
-594 __pte_chain_free
-4360 page_add_rmap
-13542 page_remove_rmap
-17060 total

I'll try out akpm's file table stuff later today.

M.

2003-02-21 01:43:12

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Performance of partial object-based rmap

> The performance delta between 2.5.62-mjb1 and 2.5.62-mjb2 is caused
> by the partial object-based rmap patch (written by Dave McCracken).
> I expect this patch to have an increasing impact on workloads with
> more processes, and it should give a substantial space saving as
> well as a performance increase. Results from 16x NUMA-Q system ...
>
> Profile comparison:
>
> before
> 15525 page_remove_rmap
> 6415 page_add_rmap
>
> after
> 2055 page_add_rmap
> 1983 page_remove_rmap

Did some space consumption comparisons on make -j 256:

before:
24116 pte_chain objects in slab cache
after:
716 pte_chain objects in slab cache

The vast majority of anonymous pages (for which we're using the non
object based method) are singletons, and hence use pte_direct ...
hence the massive space reduction.

M.

2003-02-21 02:56:44

by Andrew Morton

[permalink] [raw]
Subject: Re: Performance of partial object-based rmap

"Martin J. Bligh" <[email protected]> wrote:
>
> ...
> Did some space consumption comparisons on make -j 256:
>
> before:
> 24116 pte_chain objects in slab cache
> after:
> 716 pte_chain objects in slab cache
>
> The vast majority of anonymous pages (for which we're using the non
> object based method) are singletons, and hence use pte_direct ...
> hence the massive space reduction.
>

OK. And that'll help the fork+exit CPU overhead too.

Let's talk about this a bit.

The problem remains that this design means that we need to walk all the vma's
which are attached to a page's address_space in an attempt to unmap that
page.

Whereas with the existing pte_chain walk, we only need to visit those pte's
which are still mapping the page.

So with 1000 processes, each holding 1000 vma's against the same file, we
need to visit 1,000,000 vma's to unmap one page. If that unmap attempt is
unsuccessful, we *still* need to walk 1,000,000 vma's next time we try to
unmap it.

Whereas with pte-based rmap, we do not need to walk over already-unmapped
pte's.

So it is easy to see how there is at least a theoretical collapse in search
complexity here. Which affects all platforms. And, of course there is a
demonstrated space consumption collapse with the existing code which affects
big ia32 machines.

It is not clear how realistic the search complexity problem really is - the
2.4 virtual scan has basically the same problem and there is not a lot of
evidence of people hurting from it. In fact we see more hurt from full rmap
than from the 1000x1000 thing.

The search complexity _may_ affect less exotic things than databases. Many
processes mapping libc, say. But in that case the number of VMA's is very
small - if this was a realistic problem then 2.4 would be in real strife.

Dave's patch is simple enough to merit serious consideration.

<scratches head>

I think the guiding principle here is that we should not optimise for the
uncommon case (as rmap is doing), and we should not allow the uncommon case
to be utterly terrible (as Dave's patch can do).

It would be overly optimistic to assume that hugetlb pages will save us here.

Any bright ideas?


2003-02-21 03:15:45

by Rik van Riel

[permalink] [raw]
Subject: Re: Performance of partial object-based rmap

On Thu, 20 Feb 2003, Andrew Morton wrote:

> I think the guiding principle here is that we should not optimise for the
> uncommon case (as rmap is doing), and we should not allow the uncommon case
> to be utterly terrible (as Dave's patch can do).

This "guiding principle" appears to be the primary reason why
we've taken over a year to stabilise linux 2.0 and linux 2.2
and linux 2.4 ... or at least, too much of a focus on the first
half of this guiding principle. ;)

We absolutely have to take care in avoiding the worst case
scenarios, since statistics pretty much guarantee that somebody
will run into nothing but that scenario ...

cheers,

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-21 03:22:39

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Performance of partial object-based rmap

>> ...
>> Did some space consumption comparisons on make -j 256:
>>
>> before:
>> 24116 pte_chain objects in slab cache
>> after:
>> 716 pte_chain objects in slab cache
>>
>> The vast majority of anonymous pages (for which we're using the non
>> object based method) are singletons, and hence use pte_direct ...
>> hence the massive space reduction.
>>
>
> OK. And that'll help the fork+exit CPU overhead too.
>
> Let's talk about this a bit.
>
> The problem remains that this design means that we need to walk all the vma's
> which are attached to a page's address_space in an attempt to unmap that
> page.
>
> Whereas with the existing pte_chain walk, we only need to visit those pte's
> which are still mapping the page.
>
> So with 1000 processes, each holding 1000 vma's against the same file, we
> need to visit 1,000,000 vma's to unmap one page. If that unmap attempt is
> unsuccessful, we *still* need to walk 1,000,000 vma's next time we try to
> unmap it.

That's completely true, but I don't accept the 1000*1000 as a realistic
scenario. I do acknowledge that it will perform worse on pageout operations
in some cases ... there's certainly a tradeoff here, and we need to measure
things on the other side of the scenario.

> Whereas with pte-based rmap, we do not need to walk over already-unmapped
> pte's.
>
> So it is easy to see how there is at least a theoretical collapse in search
> complexity here. Which affects all platforms. And, of course there is a
> demonstrated space consumption collapse with the existing code which affects
> big ia32 machines.

The benefits come in both space and performance ... the space issue is,
admittedly, most important for ia32, but will to some extent affect all
machines.

> It is not clear how realistic the search complexity problem really is - the
> 2.4 virtual scan has basically the same problem and there is not a lot of
> evidence of people hurting from it. In fact we see more hurt from full rmap
> than from the 1000x1000 thing.
>
> The search complexity _may_ affect less exotic things than databases. Many
> processes mapping libc, say. But in that case the number of VMA's is very
> small - if this was a realistic problem then 2.4 would be in real strife.
>
> Dave's patch is simple enough to merit serious consideration.

Indeed, it's really astoundingly small.

> <scratches head>
>
> I think the guiding principle here is that we should not optimise for the
> uncommon case (as rmap is doing), and we should not allow the uncommon case
> to be utterly terrible (as Dave's patch can do).

Absolutely. As long as uncommon is slightly realistic that it might happen.

> It would be overly optimistic to assume that hugetlb pages will save us here.
>
> Any bright ideas?

I'd like to see a semi-realistic workload that demonstrates the downside
of this method ... both to prove that it can exceed the upside, and so that
we can use it to analyse any possible fixes. All I have at the moment is
some theory - I can perfectly understand the problem in theory, and see
that it exists ... but I'd like at least a synthetic microbenchmark that
pretends at least to simulate something vaguely realistic. I'm perfectly
prepared to write it myself if someone can show some pseudo-code or whatever
that gives a rough scenario, and some description of why they think that's
realistic. Personally, I don't really see it in reality.

We definitely need to do some UP tests under reasonably memory pressure
swapping to disk ... I'll try to get that underway soon.

M.

2003-02-21 03:36:24

by Andrew Morton

[permalink] [raw]
Subject: Re: Performance of partial object-based rmap

Rik van Riel <[email protected]> wrote:
>
> On Thu, 20 Feb 2003, Andrew Morton wrote:
>
> > I think the guiding principle here is that we should not optimise for the
> > uncommon case (as rmap is doing), and we should not allow the uncommon case
> > to be utterly terrible (as Dave's patch can do).
>
> This "guiding principle" appears to be the primary reason why
> we've taken over a year to stabilise linux 2.0 and linux 2.2
> and linux 2.4 ... or at least, too much of a focus on the first
> half of this guiding principle. ;)
>
> We absolutely have to take care in avoiding the worst case
> scenarios, since statistics pretty much guarantee that somebody
> will run into nothing but that scenario ...
>

We see things like the below report, which realistically shows
the problems with the reverse map.

I have yet to see _any_ report that the problems to which you refer
are causing difficulty in the field.

I think there's a middle ground. Hint: MAP_ORACLE.


Begin forwarded message:

Date: Tue, 17 Sep 2002 13:30:42 -0500
From: "Peter Wong" <[email protected]>
To: [email protected], [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], "Bill Hartner" <[email protected]>, "Troy C Wilson" <[email protected]>
Subject: Examining the Performance and Cost of Revesemaps on 2.5.26 Under Heavy DB Workload



I measured a decision support workload using two 2.5.26-based
kernels, one of which does NOT have the rmap rollup patch while the
other HAS. The database size is 100GB. The 2.5.26 rmap rollup patch
was created by Dave McCracken.

Based upon the throughput rate and CPU utilization, the two
kernels achieve similar performance.

Based upon /proc/meminfo, the maximum reversemap size is
22,817,885.

Based upon /proc/slabinfo, the maximum number of active pte_chain
objects is 3,411,018 with 32 bytes each. It consumes about 104 MB. The
maximum number of slabs allocated for pte_chains is 30,186 with 4KB
each, corresponding to about 118 MB. Similar memory consumption for
rmaps is observed while running the same workload and using Andrew
Morton's mm2 patch under 2.5.32. Andrew's patch can be found at
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.32/2.5.32-mm2/.

Note that since readv is not available on 2.5.26, the runs above
used "read" instead of "readv".

My previous note (Sept. 10, 2002) indicated that the memory
consumption for rmaps under 2.5.32 using "readv" is about 223 MB. And
"readv" is the preferred method for this workload. The difference of
memory consumption by using read (118 MB) and readv (223 MB) is likely
due to the different I/O algorithms used by the DBMS. When there are
multiple instances of this workload running concurrently, the amount
of memory allocated to rmaps is even more significant. More data will
be provided later.

----------------------------------------------------------------------
System Information:

- 8-way 700MHz Pentium III Xeon processors, 2MB L2 Cache
- 4GB RAM
- 6 SCSI controllers connected to 120 9.1 GB disks with 10K RPM
----------------------------------------------------------------------

Regards,
Peter

Peter Wai Yee Wong
IBM Linux Technology Center, Performance Analysis
email: [email protected]

2003-02-21 04:15:44

by Rik van Riel

[permalink] [raw]
Subject: Re: Performance of partial object-based rmap

On Thu, 20 Feb 2003, Andrew Morton wrote:

> We see things like the below report, which realistically shows
> the problems with the reverse map.

Indeed, these need to be fixed.

> I have yet to see _any_ report that the problems to which you refer
> are causing difficulty in the field.

Well, Dave's patch has only been out for a day ;))

> I think there's a middle ground. Hint: MAP_ORACLE.

I think Dave's current patch could be a good start. In many
cases where object based mapping hurts, the system would use
nonlinear VMAs anyawy.

This means we can probably get away with using Dave's object
based rmaps in the areas where they are currently supported,
while using page based rmaps for anonymous memory and nonlinear
vmas.

For object-based reverse mapping, the worst case is with large
objects which are sparsely mapped many times (nonlinear vmas)
and deeply inherited COW anonymous memory (apache w/ 300 children).

For pte chains, the worst case is with heavily shared mostly read
only areas (shared libraries) or heavily shared and used shared
memory segments.

The hybrid scheme in Dave's patch looks like it uses the right
code in the right situation, avoiding these worst cases by using
the other reverse mapping scheme.

The more I think about it, the more I think there are reasons
we might want to stick to a hybrid scheme forever...

regards,

Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-02-21 09:51:07

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Performance of partial object-based rmap

On Fri, Feb 21, 2003 at 01:25:25AM -0300, Rik van Riel wrote:
> For object-based reverse mapping, the worst case is with large
> objects which are sparsely mapped many times (nonlinear vmas)
> and deeply inherited COW anonymous memory (apache w/ 300 children).

Actually few users of this care about time, only lowmem space. It
was (re)invented, after all, to save lowmem used for vma's.

The technical issues with 32-bit aren't anywhere near as difficult
to deal with as the massive amount of backlash anything targeted
at addressing 32-bit issues gets. These extremely irritating attempts
to marginalize every line of code written to address 32-bit issues are
probably best dealt with by showing common benefits with the so-called
"desktop" machines and/or workloads.

So witness a 768MB, 600MHz Athlon, running xmms, xterms, and mozilla:

Mem: 767376k av, 761932k used, 5444k free, 0k shrd, 70044k buff
619264k active, 104160k inactive
Swap: 506008k av, 155860k used, 350148k free 197644k cached

pte_chain 5182K 5188K 99.88%
radix_tree_node 1338K 3088K 43.32%
dentry_cache 1591K 2970K 53.59%
reiser_inode_cache 480K 1226K 39.17%
buffer_head 1040K 1043K 99.79%
size-4096 828K 828K 100.00%
size-32 796K 800K 99.54%
biovec-BIO_MAX_PAGES 768K 780K 98.46%
pgd 704K 704K 100.00%

nr_page_table_pages 689

which amounts to 2756KB RAM used for PTE's, demonstrating large
amounts of internal fragmentation within the pte_chain slab.

Reducing kernel memory consumption would reduce swapping requirements,
which generally speeds things up on the precious "desktop". The vfs
should probably get its act together too, since normal usage regularly
triggers explosive dentry_cache and reiser_inode_cache space usage.

-- wli

2003-02-21 19:06:33

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Performance of partial object-based rmap

On Fri, Feb 21, 2003 at 02:00:10AM -0800, William Lee Irwin III wrote:
> nr_page_table_pages 689
> which amounts to 2756KB RAM used for PTE's, demonstrating large

Pagetables are also grossly fragmented. From the same sample:

nr_reverse_maps 156027

To preempt the FAQ, this is the number of reverse mappings performed,
where one is done for every instantiated non-swap PTE.

$ echo $(( (689*1024)/156027.0 ))
4.5218840328917427

So the space flushed down the toilet there is also relatively large.
Approximately 4.5 PTE's per pagetable page, or a utilization of:

$ echo $(( (100*689)/156027.0 ))
0.44159023758708427

percent.

-- wli

2003-02-21 19:19:02

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [Lse-tech] Re: Performance of partial object-based rmap

On Fri, Feb 21, 2003 at 02:00:10AM -0800, William Lee Irwin III wrote:
>> nr_page_table_pages 689
>> which amounts to 2756KB RAM used for PTE's, demonstrating large
> Pagetables are also grossly fragmented. From the same sample:
> nr_reverse_maps 156027
> To preempt the FAQ, this is the number of reverse mappings performed,
> where one is done for every instantiated non-swap PTE.

ARGH too early in the morning.

$ echo $(( (100*156027.0)/(689*1024) ))
22.11467593432511
percent utilization

or

$ echo $(( 156027.0/689 ))
226.45428156748912

PTE's per pagetable page.


-- wli