This patchset modifies the Linux kernel so that larger block sizes than
page size can be supported. Larger block sizes are handled by using
compound pages of an arbitrary order for the page cache instead of
single pages with order 0.
- Support is added in a way that limits the changes to existing code.
As a result filesystems can support larger page I/O with minimal changes.
- The page cache functions are mostly unchanged. Instead of a page struct
representing a single page they take a head page struct (which looks
the same as a regular page struct apart from the compound flags) and
operate on those. Most page cache functions can stay as they are.
- No locking protocols are added or modified.
- The support is also fully transparent at the level of the OS. No
specialized heuristics are added to switch to larger pages. Large
page support is enabled by filesystems or device drivers when a device
or volume is mounted. Larger block sizes are usually set during volume
creation although the patchset supports setting these sizes per file.
The formattted partition will then always be accessed with the
configured blocksize.
- Large blocks also do not mean that the 4k mmap semantics need to be abandoned.
The included mmap support will happily map 4k chunks of large blocks so that
user space sees no changes.
Some of the changes are:
- Replace the use of PAGE_CACHE_XXX constants to calculate offsets into
pages with functions that do the the same and allow the constants to
be parameterized.
- Extend the capabilities of compound pages so that they can be
put onto the LRU and reclaimed.
- Allow setting a larger blocksize via set_blocksize()
Rationales:
-----------
1. The ability to handle memory of an arbitrarily large size using
a singe page struct "handle" is essential for scaling memory handling
and reducing overhead in multiple kernel subsystems. This patchset
is a strategic move that allows performance gains throughout the
kernel.
2. Reduce fsck times. Larger block sizes mean faster file system checking.
Using 64k block size will reduce the number of blocks to be managed
by a factor of 16 and produce much denser and contiguous metadata.
3. Performance. If we look at IA64 vs. x86_64 then it seems that the
faster interrupt handling on x86_64 compensate for the speed loss due to
a smaller page size (4k vs 16k on IA64). Supporting larger block sizes
sizes on all allows a significant reduction in I/O overhead and increases
the size of I/O that can be performed by hardware in a single request
since the number of scatter gather entries are typically limited for
one request. This is going to become increasingly important to support
the ever growing memory sizes since we may have to handle excessively
large amounts of 4k requests for data sizes that may become common
soon. For example to write a 1 terabyte file the kernel would have to
handle 256 million 4k chunks.
4. Cross arch compatibility: It is currently not possible to mount
an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.
With this patch this becomes possible. Note that this also means that
some filesystems are already capable of working with blocksizes of
up to 64k (ext2, XFS) which is currently only available on a select
few arches. This patchset enables that functionality on all arches.
There are no special modifications needed to the filesystems. The
set_blocksize() function call will simply support a larger blocksize.
5. VM scalability
Large block sizes mean less state keeping for the information being
transferred. For a 1TB file one needs to handle 256 million page
structs in the VM if one uses 4k page size. A 64k page size reduces
that amount to 16 million. If the limitation in existing filesystems
are removed then even higher reductions become possible. For very
large files like that a page size of 2 MB may be beneficial which
will reduce the number of page struct to handle to 512k. The variable
nature of the block size means that the size can be tuned at file
system creation time for the anticipated needs on a volume.
6. IO scalability
The IO layer will receive large blocks of contiguious memory with
this patchset. This means that less scatter gather elements are needed
and the memory used is guaranteed to be contiguous. Instead of having
to handle 4k chunks we can f.e. handle 64k chunks in one go.
7. Limited scatter gather support restricts I/O sizes.
A lot of I/O controllers are limited in the number of scatter gather
elements that they support. For example a controller that support 128
entries in the scatter gather lists can only perform I/O of 128*4k =
512k in one go. If the blocksize is larger (f.e. 64k) then we can perform
larger I/O transfers. If we support 128 entries then 128*64k = 8M
can be transferred in one transaction.
Dave Chinner measured a performance increase of 50% when going to 64k
blocksize with XFS with an earlier version of this patchset.
8. We have problems supporting devices with a higher blocksize than
page size. This is for example important to support CD and DVDs that
can only read and write 32k or 64k blocks. We currently have a shim
layer in there to deal with this situation which limits the speed
of I/O. The developers are currently looking for ways to completely
bypass the page cache because of this deficiency.
9. 32/64k blocksize is also used in flash devices. Same issues.
10. Future harddisks will support bigger block sizes that Linux cannot
support since we are limited to PAGE_SIZE. Ok the on board cache
may buffer this for us but what is the point of handling smaller
page sizes than what the drive supports?
Fragmentation issues
--------------------
The Linux VM is gradually acquiring abilities to defragment memory. These
capabilities are partially present for 2.6.23. Later versions may merge
more of the defragmentation work. The use of large pages may cause
significant fragmentation to memory. Large buffers require pages of higher
order. Defragmentation support is necessary to insure that pages of higher
order are available or reclaimable when necessary.
There have been a number of statements that defragmentation cannot ever
work. However, the failures with the early defragmentation code from the
spring do not longer occur. I have seen no failures with 2.6.23 when
testing with 16k and 32k blocksize. The effect of the limited
defragmentation capabilities in 2.6.23 may already be sufficient for many
uses.
I would like to increase the supported blocksize to very large pages in the
future so that device drives will be capable of providing large contiguous
mapping. For that purpose I think that we need a mechanism to reserve
pools of varying large sizes at boot time. Such a mechanism can also be used
to compensate in situations where one wants to use larger buffers but
defragmentation support is not (yet?) capable to reliably provide pages
of the desired sizes.
How to make this patchset work:
-------------------------------
1. Apply this patchset or do a
git pull
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git
largeblock
(The git archive is used to keep the patchset up to date. Please send patches
against the git tree)
2. Enable LARGE_BLOCKSIZE Support
3. Compile kernel
In order to use a filesystem with a larger blocksize it needs to be formatted
for that larger blocksize. This is done using the mkfs.xxx tool for each
filesystem. Surprisingly the existing tools work without modification. These
formatting tools may warn you that the blocksize you specify is not supported
on your particular architecture. Ignore that warning since this is no longer
true after you have applied this patchset.
Tested file systems:
Filesystem Max Blocksize Changes
Reiserfs 8k Page size functions
Ext2 64k Page size functions
XFS 64k Page size functions / Remove PAGE_SIZE check
Ramfs MAX_ORDER Parameter to specify order
Todo/Issues:
- There are certainly numerous issues with this patch. I have only tested
copying files back and forth, volume creation etc. Others have run
fsxlinux on the volumes. The missing mmap support limits what can be
done for now.
- ZONE_MOVABLE is available in 2.6.23. Using the kernelcore=xxx as a kernel
parameter enables an area where defragmentation can work. This may be
necessary to avoid OOMs although I have seen no problems with up to 32k
blocksize even without that measure.
- The antifragmentation patches in Andrew's tree address more fragmentation
issues. However, large orders may still lead to fragmentation
of the movable sections. Memory compaction is still not merged and will
likely be needed to reliably support even larger orders of 256k or more.
How memory compaction impacts performance still has to be determined.
- Support for bouncing pages.
- Remove PAGE_CACHE_xxx constants after using page_cache_xxx functions
everywhere. But that will have to wait until merging becomes possible.
For now certain subsystems (shmem f.e.) are not using these functions.
They will only use order 0 pages.
- Support for non harddisk based filesystems. Remove the pktdvd etc
layers needed because the VM current does not support sufficiently
large blocksizes for these devices. Look for other places in the kernel
where we have similar issues.
V6->V7
- Mmap support
- Further cleanups
- Against 2.6.23-rc5
- Drop provocative ext2 patch
- Add patches to enable 64k blocksize in ext2/3 (Thanks, Mingming)
V5->V6:
- Rediff against 2.6.23-rc4
- Fix breakage introduced by updates to reiserfs
- Readahead fixes by Fengguang Wu <[email protected]>
- Provide a git tree that is kept up to date
V4->V5:
- Diff against 2.6.22-rc6-mm1
- provide test tree on ftp.kernel.org:/pub/linux
V3->V4
- It is possible to transparently make filesystems support larger
blocksizes by simply allowing larger blocksizes in set_blocksize.
Remove all special modifications for mmap etc from the filesystems.
This now makes 3 disk based filesystems that can use larger blocks
(reiser, ext2, xfs). Are there any other useful ones to make work?
- Patch against 2.6.22-rc4-mm2 which allows the use of Mel's antifrag
logic to avoid fragmentation.
- More page cache cleanup by applying the functions to filesystems.
- Disable bouncing when the gfp mask is setup.
- Disable mmap directly in mm/filemap.c to avoid filesystem changes
while we have no mmap support for higher order pages.
RFC V2->V3
- More restructuring
- It actually works!
- Add XFS support
- Fix up UP support
- Work out the direct I/O issues
- Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert
back to constants. Disabled for 32bit and HIGHMEM configurations.
This also allows a gradual migration to the new page cache
inline functions. LARGE_BLOCKSIZE capabilities can be
added gradually and if there is a problem then we can disable
a subsystem.
RFC V1->V2
- Some ext2 support
- Some block layer, fs layer support etc.
- Better page cache macros
- Use macros to clean up code.
--
On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
> 5. VM scalability
> Large block sizes mean less state keeping for the information being
> transferred. For a 1TB file one needs to handle 256 million page
> structs in the VM if one uses 4k page size. A 64k page size reduces
> that amount to 16 million. If the limitation in existing filesystems
> are removed then even higher reductions become possible. For very
> large files like that a page size of 2 MB may be beneficial which
> will reduce the number of page struct to handle to 512k. The variable
> nature of the block size means that the size can be tuned at file
> system creation time for the anticipated needs on a volume.
There is a limitation in the VM. Fragmentation. You keep saying this
is a solved issue and just assuming you'll be able to fix any cases
that come up as they happen.
I still don't get the feeling you realise that there is a fundamental
fragmentation issue that is unsolvable with Mel's approach.
The idea that there even _is_ a bug to fail when higher order pages
cannot be allocated was also brushed aside by some people at the
vm/fs summit. I don't know if those people had gone through the
math about this, but it goes somewhat like this: if you use a 64K
page size, you can "run out of memory" with 93% of your pages free.
If you use a 2MB page size, you can fail with 99.8% of your pages
still free. That's 64GB of memory used on a 32TB Altix.
If you don't consider that is a problem because you don't care about
theoretical issues or nobody has reported it from running -mm
kernels, then I simply can't argue against that on a technical basis.
But I'm totally against introducing known big fundamental problems to
the VM at this stage of the kernel. God knows how long it takes to ever
fix them in future after they have become pervasive throughout the
kernel.
IMO the only thing that higher order pagecache is good for is a quick
hack for filesystems to support larger block sizes. And after seeing it
is fairly ugly to support mmap, I'm not even really happy for it to do
that.
If VM scalability is a problem, then it needs to be addressed in other
areas anyway for order-0 pages, and if contiguous pages helps IO
scalability or crappy hardware, then there is nothing stopping us from
*attempting* to get contiguous memory in the current scheme.
Basically, if you're placing your hopes for VM and IO scalability on this,
then I think that's a totally broken thing to do and will end up making
the kernel worse in the years to come (except maybe on some poor
configurations of bad hardware).
On Tue, Sep 11, 2007 at 04:52:19AM +1000, Nick Piggin wrote:
> The idea that there even _is_ a bug to fail when higher order pages
> cannot be allocated was also brushed aside by some people at the
> vm/fs summit. I don't know if those people had gone through the
> math about this, but it goes somewhat like this: if you use a 64K
> page size, you can "run out of memory" with 93% of your pages free.
> If you use a 2MB page size, you can fail with 99.8% of your pages
> still free. That's 64GB of memory used on a 32TB Altix.
>
> If you don't consider that is a problem because you don't care about
> theoretical issues or nobody has reported it from running -mm
MM kernels also forbids mmap, so there's no chance the largepages are
mlocked etc... that's not the final thing that is being measured.
> kernels, then I simply can't argue against that on a technical basis.
> But I'm totally against introducing known big fundamental problems to
> the VM at this stage of the kernel. God knows how long it takes to ever
> fix them in future after they have become pervasive throughout the
> kernel.
Seconded.
> IMO the only thing that higher order pagecache is good for is a quick
> hack for filesystems to support larger block sizes. And after seeing it
> is fairly ugly to support mmap, I'm not even really happy for it to do
> that.
Additionally I feel the ones that will get the main advantage from the
quick hack are the crippled devices that are ~30% slower if the SG
tables are large.
> If VM scalability is a problem, then it needs to be addressed in other
> areas anyway for order-0 pages, and if contiguous pages helps IO
> scalability or crappy hardware, then there is nothing stopping us from
Yep.
> *attempting* to get contiguous memory in the current scheme.
>
> Basically, if you're placing your hopes for VM and IO scalability on this,
> then I think that's a totally broken thing to do and will end up making
> the kernel worse in the years to come (except maybe on some poor
> configurations of bad hardware).
Agreed. From my part I am really convinced the only sane way to
approach the VM scalability and larger-physically contiguous pages
problem is the CONFIG_PAGE_SHIFT patch (aka large PAGE_SIZE from Hugh
for 2.4). I also have to say I always disliked the PAGE_CACHE_SIZE
definition too ;). I take it only as an attempt to documentation.
Furthermore all the issues with writeprotect faults over MAP_PRIVATE
regions will have to be addressed the same way with both approaches if
we want real 100% 4k-granular backwards compatibility.
On this topic I'm also going to suggest the cpu vendors to add a 64k
tlb using the reserved 62th bitflag in the pte (right after the NX
bit). So if alignment allows we can map pagecache with a 64k large tlb
on x86 (with a PAGE_SIZE of 64k), mixing it with the 4k tlb in the
same address space if userland alignment forbids using the 64k tlb. If
we want to break backwards compatibility and force all alignments on
64k and get rid of any 4k tlb to simplify the page fault code we can
do it later anyway... No idea if this feasible to achieve on the
hardware level though, it's not my problem anyway to judge this ;). As
constraints to the hardware interface it would be ok to require the
62th 64k-tlb bitflag to be only available on the pte that would have
normally mapped a physical address 64k naturally aligned, and to
require all later overlapping 4k ptes to be set to 0. If you've better
ideas to achieve this than my interface please let me know.
And if I'm terribly wrong and the variable order pagecache is the way
to go for the long run, the 64k tlb feature will fit in that model
very nicely too.
The reason of the 64k magic number is that this is the minimum unit of
contiguous I/O required to reach platter speed on most devices out
there. And it incidentally also matches ppc64 ;).
On Tue, 11 September 2007 04:52:19 +1000, Nick Piggin wrote:
> On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
>
> > 5. VM scalability
> > Large block sizes mean less state keeping for the information being
> > transferred. For a 1TB file one needs to handle 256 million page
> > structs in the VM if one uses 4k page size. A 64k page size reduces
> > that amount to 16 million. If the limitation in existing filesystems
> > are removed then even higher reductions become possible. For very
> > large files like that a page size of 2 MB may be beneficial which
> > will reduce the number of page struct to handle to 512k. The variable
> > nature of the block size means that the size can be tuned at file
> > system creation time for the anticipated needs on a volume.
>
> The idea that there even _is_ a bug to fail when higher order pages
> cannot be allocated was also brushed aside by some people at the
> vm/fs summit. I don't know if those people had gone through the
> math about this, but it goes somewhat like this: if you use a 64K
> page size, you can "run out of memory" with 93% of your pages free.
> If you use a 2MB page size, you can fail with 99.8% of your pages
> still free. That's 64GB of memory used on a 32TB Altix.
While I agree with your concern, those numbers are quite silly. The
chances of 99.8% of pages being free and the remaining 0.2% being
perfectly spread across all 2MB large_pages are lower than those of SHA1
creating a collision. I don't see anyone abandoning git or rsync, so
your extreme example clearly is the wrong one.
Again, I agree with your concern, even though your example makes it look
silly.
Jörn
--
You can't tell where a program is going to spend its time. Bottlenecks
occur in surprising places, so don't try to second guess and put in a
speed hack until you've proven that's where the bottleneck is.
-- Rob Pike
On Tuesday 11 September 2007 22:12, Jörn Engel wrote:
> On Tue, 11 September 2007 04:52:19 +1000, Nick Piggin wrote:
> > On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
> > > 5. VM scalability
> > > Large block sizes mean less state keeping for the information being
> > > transferred. For a 1TB file one needs to handle 256 million page
> > > structs in the VM if one uses 4k page size. A 64k page size reduces
> > > that amount to 16 million. If the limitation in existing filesystems
> > > are removed then even higher reductions become possible. For very
> > > large files like that a page size of 2 MB may be beneficial which
> > > will reduce the number of page struct to handle to 512k. The
> > > variable nature of the block size means that the size can be tuned at
> > > file system creation time for the anticipated needs on a volume.
> >
> > The idea that there even _is_ a bug to fail when higher order pages
> > cannot be allocated was also brushed aside by some people at the
> > vm/fs summit. I don't know if those people had gone through the
> > math about this, but it goes somewhat like this: if you use a 64K
> > page size, you can "run out of memory" with 93% of your pages free.
> > If you use a 2MB page size, you can fail with 99.8% of your pages
> > still free. That's 64GB of memory used on a 32TB Altix.
>
> While I agree with your concern, those numbers are quite silly. The
They are the theoretical worst case. Obviously with a non trivially
sized system and non-DoS workload, they will not be reached.
> chances of 99.8% of pages being free and the remaining 0.2% being
> perfectly spread across all 2MB large_pages are lower than those of SHA1
> creating a collision. I don't see anyone abandoning git or rsync, so
> your extreme example clearly is the wrong one.
>
> Again, I agree with your concern, even though your example makes it look
> silly.
It is not simply a question of once-off chance for an all-at-once layout
to fail in this way. Fragmentation slowly builds over time, and especially
if you do actually use higher-order pages for a significant number of
things (unlike we do today), then the problem will become worse. If you
have any part of your workload that is affected by fragmentation, then
it will cause unfragmented regions to eventually be used for fragmentation
inducing allocations (by definition -- if it did not, eg. then there would be
no fragmentation problem and no need for Mel's patches).
I don't know what happens as time tends towards infinity, but I don't think
it will be good.
At millions of allocations per second, how long does it take to produce
an unacceptable number of free pages before the ENOMEM condition?
Furthermore, what *is* an unacceptable number? I don't know. I am not
trying to push this feature in, so the burden is not mine to make sure it
is OK.
Yes, we already have some of these problems today. Introducing more
and worse problems and justifying them because of existing ones is much
more silly than my quoting of the numbers. IMO.
On Tue, 2007-09-11 at 04:52 +1000, Nick Piggin wrote:
> On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
>
> > 5. VM scalability
> > Large block sizes mean less state keeping for the information being
> > transferred. For a 1TB file one needs to handle 256 million page
> > structs in the VM if one uses 4k page size. A 64k page size reduces
> > that amount to 16 million. If the limitation in existing filesystems
> > are removed then even higher reductions become possible. For very
> > large files like that a page size of 2 MB may be beneficial which
> > will reduce the number of page struct to handle to 512k. The variable
> > nature of the block size means that the size can be tuned at file
> > system creation time for the anticipated needs on a volume.
>
> There is a limitation in the VM. Fragmentation. You keep saying this
> is a solved issue and just assuming you'll be able to fix any cases
> that come up as they happen.
>
> I still don't get the feeling you realise that there is a fundamental
> fragmentation issue that is unsolvable with Mel's approach.
>
I thought we had discussed this already at VM and reached something
resembling a conclusion. It was acknowledged that depending on
contiguous allocations to always succeed will get a caller into trouble
and they need to deal with fallback - whether the problem was
theoritical or not. It was also strongly pointed out that the large
block patches as presented would be vunerable to that problem.
The alternatives were fs-block and increasing the size of order-0. It
was felt that fs-block was far away because it's complex and I thought
that increasing the pagesize like what Andrea suggested would lead to
internal fragmentation problems. Regrettably we didn't discuss Andrea's
approach in depth.
I *thought* that the end conclusion was that we would go with
Christoph's approach pending two things being resolved;
o mmap() support that we agreed on is good
o A clear statement, with logging maybe for users that mounted a large
block filesystem that it might blow up and they get to keep both parts
when it does. Basically, for now it's only suitable in specialised
environments.
I also thought there was an acknowledgement that long-term, fs-block was
the way to go - possibly using contiguous pages optimistically instead
of virtual mapping the pages. At that point, it would be a general
solution and we could remove the warnings.
Basically, to start out with, this was going to be an SGI-only thing so
they get to rattle out the issues we expect to encounter with large
blocks and help steer the direction of the
more-complex-but-safer-overall fs-block.
> The idea that there even _is_ a bug to fail when higher order pages
> cannot be allocated was also brushed aside by some people at the
> vm/fs summit.
When that brushing occured, I thought I made it very clear what the
expectations were and that without fallback they would be taking a risk.
I am not sure if that message actually sank in or not.
That said, the filesystem people can experiement to some extent against
Christoph's approach as long as they don't think they are 100% safe.
Again, their experimenting will help steer the direction of fs-block.
>
> I don't know if those people had gone through the
> math about this, but it goes somewhat like this: if you use a 64K
> page size, you can "run out of memory" with 93% of your pages free.
> If you use a 2MB page size, you can fail with 99.8% of your pages
> still free. That's 64GB of memory used on a 32TB Altix.
>
That's the absolute worst case but yes, in theory this can occur and
it's safest to assume the situation will occur somewhere to someone. It
would be difficult to craft an attack to do it but conceivably a machine
running for a long enough time would trigger it particularly if the
large block allocations are GFP_NOIO or GFP_NOFS.
> If you don't consider that is a problem because you don't care about
> theoretical issues or nobody has reported it from running -mm
> kernels, then I simply can't argue against that on a technical basis.
The -mm kernels have patches related to watermarking that will not be
making it to mainline for reasons we don't need to revisit right now.
The lack of the watermarking patches may turn out to be a non-issue but
the point is that what's in mainline is not exactly the same as -mm and
mainline will be running for longer periods of time in a different
environment.
Where we expected to see the the use of this patchset was in specialised
environments *only*. The SGI people can mitigate their mixed
fragmentation problems somewhat by setting slub_min_order ==
large_block_order so that blocks get allocated and freed at the same
size. This is partial way towards Andrea's solution of raising the size
of an order-0 allocation. The point of printing out the warnings at
mount time was not so much for a general user who may miss the logs but
for distributions that consider turning large block use on by default to
discourage them until such time as we have proper fallback in place.
> But I'm totally against introducing known big fundamental problems to
> the VM at this stage of the kernel. God knows how long it takes to ever
> fix them in future after they have become pervasive throughout the
> kernel.
>
> IMO the only thing that higher order pagecache is good for is a quick
> hack for filesystems to support larger block sizes. And after seeing it
> is fairly ugly to support mmap, I'm not even really happy for it to do
> that.
>
If the mmap() support is poor and going to be an obstacle in the future,
then that is a reason to hold it up. I haven't actually read the mmap()
support patch yet so I have no worthwhile opinion yet.
If the mmap() mess can be agreed on, the large block patchset as it is
could give us important information from the users willing to deal with
this risk about what sort of behaviour to expect. If they find it fails
all the time, then fs-block having the complexity of optimistically
using large pages is not worthwhile either. That is useful data.
> If VM scalability is a problem, then it needs to be addressed in other
> areas anyway for order-0 pages, and if contiguous pages helps IO
> scalability or crappy hardware, then there is nothing stopping us from
> *attempting* to get contiguous memory in the current scheme.
>
This was also brought up at VM Summit but for the benefit of the people
that were not there;
It was emphasised that large block support is not the solution to all
scalability problems. There was a strong emphasis on fixing up the
order-0 uses should be encouraged. In particular, readahead should be
batched so that each page is not individually locked. There were also
other page-related operations that should be done in batch. On a similar
note, it was pointed out that dcache lookup is something that should be
scaled better - possibly before spending too much time on things like
page cache or radix locks.
For scalability, it was also pointed out at some point that heavy users
of large blocks may now find themselves contending on the zone->lock and
they might well find that order-0 pages were what they wanted to use
anyway.
> Basically, if you're placing your hopes for VM and IO scalability on this,
> then I think that's a totally broken thing to do and will end up making
> the kernel worse in the years to come (except maybe on some poor
> configurations of bad hardware).
My magic 8-ball is in the garage.
I thought the following plan was sane but I could be la-la
1. Go with large block + explosions to start with
- Second class feature at this point, not fully supported
- Experiment in different places to see what it gains (if anything)
2. Get fs-block in slowly over time with the fallback options replacing
Christophs patches bit by bit
3. Kick away warnings
- First class feature at this point, fully supported
Independently of that, we would work on order-0 scalability,
particularly readahead and batching operations on ranges of pages as
much as possible.
--
Mel "la-la" Gorman
Nick Piggin <[email protected]> writes:
> On Tuesday 11 September 2007 22:12, Jörn Engel wrote:
>> On Tue, 11 September 2007 04:52:19 +1000, Nick Piggin wrote:
>> > On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
>> > > 5. VM scalability
>> > > Large block sizes mean less state keeping for the information being
>> > > transferred. For a 1TB file one needs to handle 256 million page
>> > > structs in the VM if one uses 4k page size. A 64k page size reduces
>> > > that amount to 16 million. If the limitation in existing filesystems
>> > > are removed then even higher reductions become possible. For very
>> > > large files like that a page size of 2 MB may be beneficial which
>> > > will reduce the number of page struct to handle to 512k. The
>> > > variable nature of the block size means that the size can be tuned at
>> > > file system creation time for the anticipated needs on a volume.
>> >
>> > The idea that there even _is_ a bug to fail when higher order pages
>> > cannot be allocated was also brushed aside by some people at the
>> > vm/fs summit. I don't know if those people had gone through the
>> > math about this, but it goes somewhat like this: if you use a 64K
>> > page size, you can "run out of memory" with 93% of your pages free.
>> > If you use a 2MB page size, you can fail with 99.8% of your pages
>> > still free. That's 64GB of memory used on a 32TB Altix.
>>
>> While I agree with your concern, those numbers are quite silly. The
>
> They are the theoretical worst case. Obviously with a non trivially
> sized system and non-DoS workload, they will not be reached.
I would think it should be pretty hard to have only one page out of
each 2MB chunk allocated and non evictable (writeable, swappable or
movable). Wouldn't that require some kernel driver to allocate all
pages and then selectively free them in such a pattern as to keep one
page per 2MB chunk?
Assuming nothing tries to allocate a large chunk of ram while holding
to many locks for the kernel to free it.
>> chances of 99.8% of pages being free and the remaining 0.2% being
>> perfectly spread across all 2MB large_pages are lower than those of SHA1
>> creating a collision. I don't see anyone abandoning git or rsync, so
>> your extreme example clearly is the wrong one.
>>
>> Again, I agree with your concern, even though your example makes it look
>> silly.
>
> It is not simply a question of once-off chance for an all-at-once layout
> to fail in this way. Fragmentation slowly builds over time, and especially
> if you do actually use higher-order pages for a significant number of
> things (unlike we do today), then the problem will become worse. If you
> have any part of your workload that is affected by fragmentation, then
> it will cause unfragmented regions to eventually be used for fragmentation
> inducing allocations (by definition -- if it did not, eg. then there would be
> no fragmentation problem and no need for Mel's patches).
It might be naive (stop me as soon as I go into dream world) but I
would think there are two kinds of fragmentation:
Hard fragments - physical pages the kernel can't move around
Soft fragments - virtual pages/cache that happen to cause a fragment
I would further assume most ram is used on soft fragments and that the
kernel will free them up by flushing or swapping the data when there
is sufficient need. With defragmentation support the kernel could
prevent some flushings or swapping by moving the data from one
physical page to another. But that would just reduce unneccessary work
and not change the availability of larger pages.
Further I would assume that there are two kinds of hard fragments:
Fragments allocated once at start time and temporary fragments.
At boot time (or when a module is loaded or something) you get a tiny
amount of ram allocated that will remain busy for basically ever. You
get some fragmentation right there that you can never get rid of.
At runtime a lot of pages are allocated and quickly freed again. They
get preferably positions in regions where there already is
fragmentation. In regions where there are suitable sized holes
already. They would only break a free 2MB chunk into smaller chunks if
there is no small hole to be found.
Now a trick I would use is to put kernel allocated pages at one end of
the ram and virtual/cache pages at the other end. Small kernel allocs
would find holes at the start of the ram while big allocs would have
to move more to the middle or end of the ram to find a large enough
hole. And virtual/cache pages could always be cleared out to free
large continious chunks.
Splitting the two types would prevent fragmentation of freeable and
not freeable regions giving us always a large pool to pull compound
pages from.
One could also split the ram into regions of different page sizes,
meaning that some large compound pages may not be split below a
certain limit. E.g. some amount of ram would be reserved for chunk
>=64k only. This should be configurable via sys.
> I don't know what happens as time tends towards infinity, but I don't think
> it will be good.
It depends on the lifetime of the allocations. If the lifetime is
uniform enough then larger chunks of memory allocated for small
objects will always be freed after a short time. If the lifetime
varies widely then it can happen that one page of a larger chunk
remains busy far longer than the rest causing fragmentation.
I'm hopeing that we don't have such wide variance in lifetime that we
run into a ENOMEM. I'm hoping allocation and freeing are not random
events that would result in an expectancy of an infinite number of
allocations to be alife as time tends towards infinity. I'm hoping
there is enough dependence between the two to impose an upper limit on
the fragmentation.
> At millions of allocations per second, how long does it take to produce
> an unacceptable number of free pages before the ENOMEM condition?
> Furthermore, what *is* an unacceptable number? I don't know. I am not
> trying to push this feature in, so the burden is not mine to make sure it
> is OK.
I think the only acceptable solution is to have the code cope with
large pages being unavailable and use multiple smaller chunks instead
in a tight spot. By all means try to use a large continious chunk but
never fail just because we are too fragmented. I'm sure modern system
with 4+GB ram will not run into the wall but i'm equaly sure older
systems with as little as 64MB quickly will. Handling the fragmented
case is the only way to make sure we keep running.
> Yes, we already have some of these problems today. Introducing more
> and worse problems and justifying them because of existing ones is much
> more silly than my quoting of the numbers. IMO.
MfG
Goswin
Hi Mel,
On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> that increasing the pagesize like what Andrea suggested would lead to
> internal fragmentation problems. Regrettably we didn't discuss Andrea's
The config_page_shift guarantees the kernel stacks or whatever not
defragmentable allocation other allocation goes into the same 64k "not
defragmentable" page. Not like with SGI design that a 8k kernel stack
could be allocated in the first 64k page, and then another 8k stack
could be allocated in the next 64k page, effectively pinning all 64k
pages until Nick worst case scenario triggers.
What I said at the VM summit is that your reclaim-defrag patch in the
slub isn't necessarily entirely useless with config_page_shift,
because the larger the software page_size, the more partial pages we
could find in the slab, so to save some memory if there are tons of
pages very partially used, we could free some of them.
But the whole point is that with the config_page_shift, Nick's worst
case scenario can't happen by design regardless of defrag or not
defrag. While it can _definitely_ happen with SGI design (regardless
of any defrag thing). We can still try to save some memory by
defragging the slab a bit, but it's by far *not* required with
config_page_shift. No defrag at all is required infact.
Plus there's a cost in defragging and freeing cache... the more you
need defrag, the slower the kernel will be.
> approach in depth.
Well it wasn't my fault if we didn't discuss it in depth though. I
tried to discuss it in all possible occasions where I was suggested to
talk about it and where it was somewhat on topic. Given I wasn't even
invited at the KS, I felt it would not be appropriate for me to try to
monopolize the VM summit according to my agenda. So I happily listened
to what the top kernel developers are planning ;), while giving
some hints on what I think the right direction is instead.
> I *thought* that the end conclusion was that we would go with
Frankly I don't care what the end conclusion was.
> Christoph's approach pending two things being resolved;
>
> o mmap() support that we agreed on is good
Let's see how good the mmap support for variable order page size will
work after the 2 weeks...
> o A clear statement, with logging maybe for users that mounted a large
> block filesystem that it might blow up and they get to keep both parts
> when it does. Basically, for now it's only suitable in specialised
> environments.
Yes, but perhaps you missed that such printk is needed exactly to
provide proof that SGI design is the wrong way and it needs to be
dumped. If that printk ever triggers it means you were totally wrong.
> I also thought there was an acknowledgement that long-term, fs-block was
> the way to go - possibly using contiguous pages optimistically instead
> of virtual mapping the pages. At that point, it would be a general
> solution and we could remove the warnings.
fsblock should stack on top of config_page_shift simply. Both are
needed. You don't want to use 64k pages on a laptop but you may want a
larger blocksize for the btrees etc... if you've a large harddisk and
not much ram.
> That's the absolute worst case but yes, in theory this can occur and
> it's safest to assume the situation will occur somewhere to someone. It
Do you agree this worst case can't happen with config_page_shift?
> Where we expected to see the the use of this patchset was in specialised
> environments *only*. The SGI people can mitigate their mixed
> fragmentation problems somewhat by setting slub_min_order ==
> large_block_order so that blocks get allocated and freed at the same
> size. This is partial way towards Andrea's solution of raising the size
> of an order-0 allocation. The point of printing out the warnings at
Except you don't get all the full benefits of it...
Even if I could end up mapping 4k kmalloced entries in userland for
the tail packing, that IMHO would still be a preferable solution than
to keep the base-page small and to make an hard effort to create large
pages out of small pages. The approach I advocate keeps the base page
big and the fast path fast, and it rather does some work to split the
base pages outside the buddy for the small files.
All your defrag work is still good to have, like I said at the VM
summit if you remember, to grow the hugetlbfs at runtime etc... I just
rather avoid to depend on it to avoid I/O failure in presence of
mlocked pagecache for example.
> Independently of that, we would work on order-0 scalability,
> particularly readahead and batching operations on ranges of pages as
> much as possible.
That's pretty much an unnecessary logic, if the order0 pages become
larger.
On Wednesday 12 September 2007 01:36, Mel Gorman wrote:
> On Tue, 2007-09-11 at 04:52 +1000, Nick Piggin wrote:
> > On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
> > > 5. VM scalability
> > > Large block sizes mean less state keeping for the information being
> > > transferred. For a 1TB file one needs to handle 256 million page
> > > structs in the VM if one uses 4k page size. A 64k page size reduces
> > > that amount to 16 million. If the limitation in existing filesystems
> > > are removed then even higher reductions become possible. For very
> > > large files like that a page size of 2 MB may be beneficial which
> > > will reduce the number of page struct to handle to 512k. The
> > > variable nature of the block size means that the size can be tuned at
> > > file system creation time for the anticipated needs on a volume.
> >
> > There is a limitation in the VM. Fragmentation. You keep saying this
> > is a solved issue and just assuming you'll be able to fix any cases
> > that come up as they happen.
> >
> > I still don't get the feeling you realise that there is a fundamental
> > fragmentation issue that is unsolvable with Mel's approach.
>
> I thought we had discussed this already at VM and reached something
> resembling a conclusion. It was acknowledged that depending on
> contiguous allocations to always succeed will get a caller into trouble
> and they need to deal with fallback - whether the problem was
> theoritical or not. It was also strongly pointed out that the large
> block patches as presented would be vunerable to that problem.
Well Christoph seems to still be spinning them as a solution for VM
scalability and first class support for making contiguous IOs, large
filesystem block sizes etc.
At the VM summit I think the conclusion was that grouping by
mobility could be merged. I'm still not thrilled by that, but I was
going to get steamrolled[*] anyway... and seeing as the userspace
hugepages is a relatively demanded workload and can be
implemented in this way with basically no other changes to the
kernel and already must have fallbacks.... then that's actually a
reasonable case for it.
The higher order pagecache, again I'm just going to get steamrolled
on, and it actually isn't so intrusive minus the mmap changes, so I
didn't have much to reasonably say there.
And I would have kept quiet this time too, except for the worrying idea
to use higher order pages to fix the SLUB vs SLAB regression, and if
the rationale for this patchset was more realistic.
[*] And I don't say steamrolled because I'm bitter and twisted :) I
personally want the kernel to be perfect. But I realise it already isn't
and for practical purposes people want these things, so I accept
being overruled, no problem. The fact simply is -- I would have been
steamrolled I think :P
> The alternatives were fs-block and increasing the size of order-0. It
> was felt that fs-block was far away because it's complex and I thought
> that increasing the pagesize like what Andrea suggested would lead to
> internal fragmentation problems. Regrettably we didn't discuss Andrea's
> approach in depth.
Sure. And some people run workloads where fragmentation is likely never
going to be a problem, they are shipping this poorly configured hardware
now or soon, so they don't have too much interest in doing it right at this
point, rather than doing it *now*. OK, that's a valid reason which is why I
don't use the argument that we should do it correctly or never at all.
> I *thought* that the end conclusion was that we would go with
> Christoph's approach pending two things being resolved;
>
> o mmap() support that we agreed on is good
In theory (and again for the filesystem guys who don't have to worry about
it). In practice after seeing the patch it's not a nice thing for the VM to
have to do.
> I also thought there was an acknowledgement that long-term, fs-block was
> the way to go - possibly using contiguous pages optimistically instead
> of virtual mapping the pages. At that point, it would be a general
> solution and we could remove the warnings.
I guess it is still in the air. I personally think a vmapping approach and/or
teaching filesystems to do some nonlinear block metadata access is the
way to go (strangely, this happens to be one of the fsblock paradigms!).
OTOH, I'm not sure how much buy-in there was from the filesystems guys.
Particularly Christoph H and XFS (which is strange because they already do
vmapping in places).
That's understandable though. It is a lot of work for filesystems. But the
reason I think it is the correct approach for larger block than soft-page
size is that it doesn't have fundamental issues (assuming that virtually
mapping the entire kernel is off the table).
> Basically, to start out with, this was going to be an SGI-only thing so
> they get to rattle out the issues we expect to encounter with large
> blocks and help steer the direction of the
> more-complex-but-safer-overall fs-block.
That's what I expected, but it seems from the descriptions in the patches
that it is also supposed to cure cancer :)
> > The idea that there even _is_ a bug to fail when higher order pages
> > cannot be allocated was also brushed aside by some people at the
> > vm/fs summit.
>
> When that brushing occured, I thought I made it very clear what the
> expectations were and that without fallback they would be taking a risk.
> I am not sure if that message actually sank in or not.
No, you have been good about that aspect. I wasn't trying to point to you
at all here.
> > I don't know if those people had gone through the
> > math about this, but it goes somewhat like this: if you use a 64K
> > page size, you can "run out of memory" with 93% of your pages free.
> > If you use a 2MB page size, you can fail with 99.8% of your pages
> > still free. That's 64GB of memory used on a 32TB Altix.
>
> That's the absolute worst case but yes, in theory this can occur and
> it's safest to assume the situation will occur somewhere to someone. It
> would be difficult to craft an attack to do it but conceivably a machine
> running for a long enough time would trigger it particularly if the
> large block allocations are GFP_NOIO or GFP_NOFS.
It would be interesting to craft an attack. If you knew roughly the layout
and size of your dentry slab for example... maybe you could stat a whole
lot of files, then open one and keep it open (maybe post the fd to a unix
socket or something crazy!) when you think you have filled up a couple
of MB worth of them. Repeat the process until your movable zone is
gone. Or do the same things with pagetables, or task structs, or radix
tree nodes, etc.. these are the kinds of things I worry about (as well as
just the gradual natural degredation).
Yeah, it might be reasonably possible to make an attack that would
deplete most of higher order allocations while pinning somewhat close
to just the theoretical minimum required.
[snip]
Thanks Mel. Fairly good summary I think.
> > Basically, if you're placing your hopes for VM and IO scalability on
> > this, then I think that's a totally broken thing to do and will end up
> > making the kernel worse in the years to come (except maybe on some poor
> > configurations of bad hardware).
>
> My magic 8-ball is in the garage.
>
> I thought the following plan was sane but I could be la-la
>
> 1. Go with large block + explosions to start with
> - Second class feature at this point, not fully supported
> - Experiment in different places to see what it gains (if anything)
> 2. Get fs-block in slowly over time with the fallback options replacing
> Christophs patches bit by bit
> 3. Kick away warnings
> - First class feature at this point, fully supported
I guess that was my hope. The only problem I have with a 2nd class
higher order pagecache on a *practical* technical issue is introducing
more complexity in the VM for mmap. Andrea and Hugh are probably
more guardians of that area of code than I, so if they're happy with the
mmap stuff then again I can accept being overruled on this ;)
Then I would love to say #2 will go ahead (and I hope it would), but I
can't force it down the throat of the filesystem maintainers just like I
feel they can't force vm devs (me) to do a virtually mapped and
defrag-able kernel :) Basically I'm trying to practice what I preach and
I don't want to force fsblock onto anyone.
Maybe when ext2 is converted and if I can show it isn't a performance
problem / too much complexity then I'll have another leg to stand on
here... I don't know.
> Independently of that, we would work on order-0 scalability,
> particularly readahead and batching operations on ranges of pages as
> much as possible.
Definitely. Also, aops capable of spanning multiple pages, batching of
large write(2) pagecache insertion, etc all are things we must go after,
regardless of the large page and/or block size work.
On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote:
> Hi Mel,
>
Hi,
> On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> > that increasing the pagesize like what Andrea suggested would lead to
> > internal fragmentation problems. Regrettably we didn't discuss Andrea's
>
> The config_page_shift guarantees the kernel stacks or whatever not
> defragmentable allocation other allocation goes into the same 64k "not
> defragmentable" page. Not like with SGI design that a 8k kernel stack
> could be allocated in the first 64k page, and then another 8k stack
> could be allocated in the next 64k page, effectively pinning all 64k
> pages until Nick worst case scenario triggers.
>
In practice, it's pretty difficult to trigger. Buddy allocators always
try and use the smallest possible sized buddy to split. Once a 64K is
split for a 4K or 8K allocation, the remainder of that block will be
used for other 4K, 8K, 16K, 32K allocations. The situation where
multiple 64K blocks gets split does not occur.
Now, the worst case scenario for your patch is that a hostile process
allocates large amount of memory and mlocks() one 4K page per 64K chunk
(this is unlikely in practice I know). The end result is you have many
64KB regions that are now unusable because 4K is pinned in each of them.
Your approach is not immune from problems either. To me, only Nicks
approach is bullet-proof in the long run.
> What I said at the VM summit is that your reclaim-defrag patch in the
> slub isn't necessarily entirely useless with config_page_shift,
> because the larger the software page_size, the more partial pages we
> could find in the slab, so to save some memory if there are tons of
> pages very partially used, we could free some of them.
>
This is true. Slub targetted reclaim (Chrisophs work) is useful
independent of this current problem.
> But the whole point is that with the config_page_shift, Nick's worst
> case scenario can't happen by design regardless of defrag or not
> defrag.
I agree with this. It's why I thought Nick's approach was where we were
going to finish up ultimately.
> While it can _definitely_ happen with SGI design (regardless
> of any defrag thing).
I have never stated that the SGI design is immune from this problem.
> We can still try to save some memory by
> defragging the slab a bit, but it's by far *not* required with
> config_page_shift. No defrag at all is required infact.
>
You will need to take some sort of defragmentation to deal with internal
fragmentation. It's a very similar problem to blasting away at slab
pages and still not being able to free them because objects are in use.
Replace "slab" with "large page" and "object" with "4k page" and the
issues are similar.
> Plus there's a cost in defragging and freeing cache... the more you
> need defrag, the slower the kernel will be.
>
> > approach in depth.
>
> Well it wasn't my fault if we didn't discuss it in depth though.
If it's my fault, sorry about that. It wasn't my intention.
> I
> tried to discuss it in all possible occasions where I was suggested to
> talk about it and where it was somewhat on topic.
Who said it was off-topic? Again, if this was me, sorry - you should
have chucked something at my head to shut me up.
> Given I wasn't even
> invited at the KS, I felt it would not be appropriate for me to try to
> monopolize the VM summit according to my agenda. So I happily listened
> to what the top kernel developers are planning ;), while giving
> some hints on what I think the right direction is instead.
>
Right, clearly we failed or at least had sub-optimal results dicussion
this one at VM Summit. Good job we have mail to pick up the stick with.
> > I *thought* that the end conclusion was that we would go with
>
> Frankly I don't care what the end conclusion was.
>
heh. Well we need to come to some sort of conclusion here or this will
go around the merri-go-round till we're all bald.
> > Christoph's approach pending two things being resolved;
> >
> > o mmap() support that we agreed on is good
>
> Let's see how good the mmap support for variable order page size will
> work after the 2 weeks...
>
Ok, I'm ok with that.
> > o A clear statement, with logging maybe for users that mounted a large
> > block filesystem that it might blow up and they get to keep both parts
> > when it does. Basically, for now it's only suitable in specialised
> > environments.
>
> Yes, but perhaps you missed that such printk is needed exactly to
> provide proof that SGI design is the wrong way and it needs to be
> dumped. If that printk ever triggers it means you were totally wrong.
>
heh, I suggested printing the warning because I knew it had this
problem. The purpose in my mind was to see how far the design could be
brought before fs-block had to fill in the holes.
> > I also thought there was an acknowledgement that long-term, fs-block was
> > the way to go - possibly using contiguous pages optimistically instead
> > of virtual mapping the pages. At that point, it would be a general
> > solution and we could remove the warnings.
>
> fsblock should stack on top of config_page_shift simply.
It should be able to stack on top of either approach and arguably
setting slub_min_order=large_block_order with large block filesystems is
90% of your approach anyway.
> Both are
> needed. You don't want to use 64k pages on a laptop but you may want a
> larger blocksize for the btrees etc... if you've a large harddisk and
> not much ram.
I am still failing to see what happens when there are pagetable pages,
slab objects or mlocked 4k pages pinning the 64K pages and you need to
allocate another 64K page for the filesystem. I *think* you deadlock in
a similar fashion to Christoph's approach but the shape of the problem
is different because we are dealing with internal instead of external
fragmentation. Am I wrong?
> > That's the absolute worst case but yes, in theory this can occur and
> > it's safest to assume the situation will occur somewhere to someone. It
>
> Do you agree this worst case can't happen with config_page_shift?
>
Yes. I just think you have a different worst case that is just as bad.
> > Where we expected to see the the use of this patchset was in specialised
> > environments *only*. The SGI people can mitigate their mixed
> > fragmentation problems somewhat by setting slub_min_order ==
> > large_block_order so that blocks get allocated and freed at the same
> > size. This is partial way towards Andrea's solution of raising the size
> > of an order-0 allocation. The point of printing out the warnings at
>
> Except you don't get all the full benefits of it...
>
> Even if I could end up mapping 4k kmalloced entries in userland for
> the tail packing, that IMHO would still be a preferable solution than
> to keep the base-page small and to make an hard effort to create large
> pages out of small pages. The approach I advocate keeps the base page
> big and the fast path fast, and it rather does some work to split the
> base pages outside the buddy for the small files.
>
small files (you need something like Shaggy's page tail packing),
pagetable pages, pte pages all have to be dealt with. These are the
things I think will cause us internal fragmentaiton problems.
> All your defrag work is still good to have, like I said at the VM
> summit if you remember, to grow the hugetlbfs at runtime etc... I just
> rather avoid to depend on it to avoid I/O failure in presence of
> mlocked pagecache for example.
>
I'd rather avoid depending on it for the system to work 100% of the
same. Hence I've been saying that we need fsblock ultimately for this to
be a 100% supported feature.
> > Independently of that, we would work on order-0 scalability,
> > particularly readahead and batching operations on ranges of pages as
> > much as possible.
>
> That's pretty much an unnecessary logic, if the order0 pages become
> larger.
>
Quite possibly.
--
Mel Gorman
On Wednesday 12 September 2007 04:31, Mel Gorman wrote:
> On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote:
> > Hi Mel,
>
> Hi,
>
> > On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> > > that increasing the pagesize like what Andrea suggested would lead to
> > > internal fragmentation problems. Regrettably we didn't discuss Andrea's
> >
> > The config_page_shift guarantees the kernel stacks or whatever not
> > defragmentable allocation other allocation goes into the same 64k "not
> > defragmentable" page. Not like with SGI design that a 8k kernel stack
> > could be allocated in the first 64k page, and then another 8k stack
> > could be allocated in the next 64k page, effectively pinning all 64k
> > pages until Nick worst case scenario triggers.
>
> In practice, it's pretty difficult to trigger. Buddy allocators always
> try and use the smallest possible sized buddy to split. Once a 64K is
> split for a 4K or 8K allocation, the remainder of that block will be
> used for other 4K, 8K, 16K, 32K allocations. The situation where
> multiple 64K blocks gets split does not occur.
>
> Now, the worst case scenario for your patch is that a hostile process
> allocates large amount of memory and mlocks() one 4K page per 64K chunk
> (this is unlikely in practice I know). The end result is you have many
> 64KB regions that are now unusable because 4K is pinned in each of them.
> Your approach is not immune from problems either. To me, only Nicks
> approach is bullet-proof in the long run.
One important thing I think in Andrea's case, the memory will be accounted
for (eg. we can limit mlock, or work within various memory accounting things).
With fragmentation, I suspect it will be much more difficult to do this. It
would be another layer of heuristics that will also inevitably go wrong
at times if you try to limit how much "fragmentation" a process can do.
Quite likely it is hard to make something even work reasonably well in
most cases.
> > We can still try to save some memory by
> > defragging the slab a bit, but it's by far *not* required with
> > config_page_shift. No defrag at all is required infact.
>
> You will need to take some sort of defragmentation to deal with internal
> fragmentation. It's a very similar problem to blasting away at slab
> pages and still not being able to free them because objects are in use.
> Replace "slab" with "large page" and "object" with "4k page" and the
> issues are similar.
Well yes and slab has issues today too with internal fragmentation,
targetted reclaim and some (small) higher order allocations too today.
But at least with config_page_shift, you don't introduce _new_ sources
of problems (eg. coming from pagecache or other allocs).
Sure, there are some other things -- like pagecache can actually use
up more memory instead -- but there are a number of other positives
that Andrea's has as well. It is using order-0 pages, which are first class
throughout the VM; they have per-cpu queues, and do not require any
special reclaim code. They also *actually do* reduce the page
management overhead in the general case, unlike higher order pcache.
So combined with the accounting issues, I think it is unfair to say that
Andrea's is just moving the fragmentation to internal. It has a number
of upsides. I have no idea how it will actually behave and perform, mind
you ;)
> > Plus there's a cost in defragging and freeing cache... the more you
> > need defrag, the slower the kernel will be.
> >
> > > approach in depth.
> >
> > Well it wasn't my fault if we didn't discuss it in depth though.
>
> If it's my fault, sorry about that. It wasn't my intention.
I think it did get brushed aside a little quickly too (not blaming anyone).
Maybe because Linus was hostile. But *if* the idea is that page
management overhead has or will become a problem that needs fixing,
then neither higher order pagecache, nor (obviously) fsblock, fixes this
properly. Andrea's most definitely has the potential to.
On Tuesday 11 September 2007 05:26:05 Nick Piggin wrote:
> On Wednesday 12 September 2007 04:31, Mel Gorman wrote:
> > On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote:
> > > Hi Mel,
> >
> > Hi,
> >
> > > On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> > > > that increasing the pagesize like what Andrea suggested would lead to
> > > > internal fragmentation problems. Regrettably we didn't discuss Andrea's
> > >
> > > The config_page_shift guarantees the kernel stacks or whatever not
> > > defragmentable allocation other allocation goes into the same 64k "not
> > > defragmentable" page. Not like with SGI design that a 8k kernel stack
> > > could be allocated in the first 64k page, and then another 8k stack
> > > could be allocated in the next 64k page, effectively pinning all 64k
> > > pages until Nick worst case scenario triggers.
> >
> > In practice, it's pretty difficult to trigger. Buddy allocators always
> > try and use the smallest possible sized buddy to split. Once a 64K is
> > split for a 4K or 8K allocation, the remainder of that block will be
> > used for other 4K, 8K, 16K, 32K allocations. The situation where
> > multiple 64K blocks gets split does not occur.
> >
> > Now, the worst case scenario for your patch is that a hostile process
> > allocates large amount of memory and mlocks() one 4K page per 64K chunk
> > (this is unlikely in practice I know). The end result is you have many
> > 64KB regions that are now unusable because 4K is pinned in each of them.
> > Your approach is not immune from problems either. To me, only Nicks
> > approach is bullet-proof in the long run.
>
> One important thing I think in Andrea's case, the memory will be accounted
> for (eg. we can limit mlock, or work within various memory accounting things).
>
> With fragmentation, I suspect it will be much more difficult to do this. It
> would be another layer of heuristics that will also inevitably go wrong
> at times if you try to limit how much "fragmentation" a process can do.
> Quite likely it is hard to make something even work reasonably well in
> most cases.
>
>
> > > We can still try to save some memory by
> > > defragging the slab a bit, but it's by far *not* required with
> > > config_page_shift. No defrag at all is required infact.
> >
> > You will need to take some sort of defragmentation to deal with internal
> > fragmentation. It's a very similar problem to blasting away at slab
> > pages and still not being able to free them because objects are in use.
> > Replace "slab" with "large page" and "object" with "4k page" and the
> > issues are similar.
>
> Well yes and slab has issues today too with internal fragmentation,
> targetted reclaim and some (small) higher order allocations too today.
> But at least with config_page_shift, you don't introduce _new_ sources
> of problems (eg. coming from pagecache or other allocs).
>
> Sure, there are some other things -- like pagecache can actually use
> up more memory instead -- but there are a number of other positives
> that Andrea's has as well. It is using order-0 pages, which are first class
> throughout the VM; they have per-cpu queues, and do not require any
> special reclaim code. They also *actually do* reduce the page
> management overhead in the general case, unlike higher order pcache.
>
> So combined with the accounting issues, I think it is unfair to say that
> Andrea's is just moving the fragmentation to internal. It has a number
> of upsides. I have no idea how it will actually behave and perform, mind
> you ;)
>
>
> > > Plus there's a cost in defragging and freeing cache... the more you
> > > need defrag, the slower the kernel will be.
> > >
> > > > approach in depth.
> > >
> > > Well it wasn't my fault if we didn't discuss it in depth though.
> >
> > If it's my fault, sorry about that. It wasn't my intention.
>
> I think it did get brushed aside a little quickly too (not blaming anyone).
> Maybe because Linus was hostile. But *if* the idea is that page
> management overhead has or will become a problem that needs fixing,
> then neither higher order pagecache, nor (obviously) fsblock, fixes this
> properly. Andrea's most definitely has the potential to.
>
Hi,
I think that fundamental problem is no fragmentation/large pages/...
The problem is the VM itself.
The vm doesn't use virtual memory, thats all, that the problem.
Although this will be probably linux 3.0, I think that the right way to solve all those problems
is to make all kernel memory vmalloced (except few areas like kernel .text)
It will suddenly remove the buddy allocator, it will remove need for highmem, it will allow to allocate any amount of memory
(for example 4k stacks will be obsolete)
It will even allow kernel memory to be swapped to disk.
This is the solution, but it is very very hard.
Best regards,
Maxim Levitsky
On Wednesday 12 September 2007 04:25, Maxim Levitsky wrote:
> Hi,
>
> I think that fundamental problem is no fragmentation/large pages/...
>
> The problem is the VM itself.
> The vm doesn't use virtual memory, thats all, that the problem.
> Although this will be probably linux 3.0, I think that the right way to
> solve all those problems is to make all kernel memory vmalloced (except few
> areas like kernel .text)
>
> It will suddenly remove the buddy allocator, it will remove need for
> highmem, it will allow to allocate any amount of memory (for example 4k
> stacks will be obsolete)
> It will even allow kernel memory to be swapped to disk.
>
> This is the solution, but it is very very hard.
I'm not sure that it is too hard. OK it is far from trivial...
This is not a new idea though, it has been floated around for a long
time (since before Linux I'm sure, although have no references).
There are lots of reasons why such an approach has fundamental
performance problems too, however. Your kernel can't use huge tlbs
for a lot of memory, you can't find the physical address of a page
without walking page tables, defragmenting still has a significant
cost in terms of moving pages and flushing TLBs etc.
So the train of thought up to now has been that a virtually mapped
kernel would be "the problem with the VM itself" ;)
We're actually at a point now where higher order allocations are
pretty rare and not a big problem (except with very special cases
like hugepages and memory hotplug which can mostly get away
with compromises, so we don't want to turn over the kernel just
for these).
So in my opinion, any increase of the dependence on higher order
allocations is simply a bad move until a killer use-case can be found.
They move us further away from good behaviour on our assumed
ideal of an identity mapped kernel.
(I don't actually dislike the idea of virtually mapped kernel. Maybe
hardware trends will favour that model and there are some potential
simple instructions a CPU can implement to help with some of the
performance hits. I'm sure it will be looked at again for Linux one day)
Hi,
On Tue, Sep 11, 2007 at 07:31:01PM +0100, Mel Gorman wrote:
> Now, the worst case scenario for your patch is that a hostile process
> allocates large amount of memory and mlocks() one 4K page per 64K chunk
> (this is unlikely in practice I know). The end result is you have many
> 64KB regions that are now unusable because 4K is pinned in each of them.
Initially 4k kmalloced tails aren't going to be mapped in
userland. But let's take the kernel stack that would generate the same
problem and that is clearly going to pin the whole 64k slab/slub
page.
What I think you're missing is that for Nick's worst case to trigger
with the config_page_shift design, you would need the _whole_ ram to
be _at_least_once_ allocated completely in kernel stacks. If the whole
100% of ram wouldn't go allocated in slub as a pure kernel stack, such
a scenario could never materialize.
With the SGI design + defrag, Nick's scenario can instead happen with
only total_ram/64k kernel stacks allocated.
The the problem with the slub fragmentation isn't a new problem, it
happens in today kernels as well and at least the slab by design is
meant to _defrag_ internally. So it's practically already solved and
it provides some guarantee unlike the buddy allocator.
> If it's my fault, sorry about that. It wasn't my intention.
It's not the fault of anyone, I simply didn't push too hard towards my
agenda for the reasons I just said, but I used any given opportunity
to discuss it.
With on-topic I meant not talking about it during the other topics,
like mmap_sem or RCU with radix tree lock ;)
> heh. Well we need to come to some sort of conclusion here or this will
> go around the merri-go-round till we're all bald.
Well, I only meant I'm still free to disagree if I think there's a
better way. All SGI has provided so far is data to show that their I/O
subsystem is much faster if the data is physically contiguous in ram
(ask Linus if you want more details, or better don't ask). That's not
very interesting data for my usages and with my hardware, and I guess
it's more likely that config_page_shift will produce interesting
numbers than their patch on my possible usage cases, but we'll never
know until both are finished.
> heh, I suggested printing the warning because I knew it had this
> problem. The purpose in my mind was to see how far the design could be
> brought before fs-block had to fill in the holes.
Indeed!
> I am still failing to see what happens when there are pagetable pages,
> slab objects or mlocked 4k pages pinning the 64K pages and you need to
> allocate another 64K page for the filesystem. I *think* you deadlock in
> a similar fashion to Christoph's approach but the shape of the problem
> is different because we are dealing with internal instead of external
> fragmentation. Am I wrong?
pagetables aren't the issue. They should be still pre-allocated in
page_size chunks. The 4k entries with 64k page-size are sure not worse
than a 32byte kmalloc today, the slab by design defragments the
stuff. There's probably room for improvement in that area even without
freeing any object by just ordering the list with an rbtree (or better
an heak like CFS should also use!!) so to always allocate new slabs
from the most full partial slab, that alone would help a lot probably
(not sure if slub does anything like that, I'm not fond on slub yet).
> Yes. I just think you have a different worst case that is just as bad.
Disagree here...
> small files (you need something like Shaggy's page tail packing),
> pagetable pages, pte pages all have to be dealt with. These are the
> things I think will cause us internal fragmentaiton problems.
Also note that not all users will need to turn on the tail
packing. We're here talking about features that not all users will
need anyway.. And we're in the same boat as ppc64, no difference.
Thanks!
On Tue, 11 Sep 2007, Nick Piggin wrote:
> There is a limitation in the VM. Fragmentation. You keep saying this
> is a solved issue and just assuming you'll be able to fix any cases
> that come up as they happen.
>
> I still don't get the feeling you realise that there is a fundamental
> fragmentation issue that is unsolvable with Mel's approach.
Well my problem first of all is that you did not read the full message. It
discusses that later and provides page pools to address the issue.
Secondly you keep FUDding people with lots of theoretical concerns
assuming Mel's approaches must fail. If there is an issue (I guess there
must be right?) then please give us a concrete case of a failure that we
can work against.
> The idea that there even _is_ a bug to fail when higher order pages
> cannot be allocated was also brushed aside by some people at the
> vm/fs summit. I don't know if those people had gone through the
> math about this, but it goes somewhat like this: if you use a 64K
> page size, you can "run out of memory" with 93% of your pages free.
> If you use a 2MB page size, you can fail with 99.8% of your pages
> still free. That's 64GB of memory used on a 32TB Altix.
Allocations can currently fail and all code has the requirement to handle
failure cases in one form or another.
Currently we can only handle up to order 3 allocs it seems. 2M pages (and
in particular pagesizes > MAX_ORDER) will have to be handled by a separate
large page pool facility discussed in the earlier message.
On Tue, 11 Sep 2007, Andrea Arcangeli wrote:
> Furthermore all the issues with writeprotect faults over MAP_PRIVATE
> regions will have to be addressed the same way with both approaches if
> we want real 100% 4k-granular backwards compatibility.
Could you be more specific as to why my patch does not address that issue?
> And if I'm terribly wrong and the variable order pagecache is the way
> to go for the long run, the 64k tlb feature will fit in that model
> very nicely too.
Right.
On Tue, 11 Sep 2007, J?rn Engel wrote:
> While I agree with your concern, those numbers are quite silly. The
> chances of 99.8% of pages being free and the remaining 0.2% being
> perfectly spread across all 2MB large_pages are lower than those of SHA1
> creating a collision. I don't see anyone abandoning git or rsync, so
> your extreme example clearly is the wrong one.
>
> Again, I agree with your concern, even though your example makes it look
> silly.
You may want to consider Mel's antifrag approaches which certainly
decreases the chance of this occurring. Reclaim can open up the needed
linear memory hole in a intentional way. The memory compaction approach
can even move pages to open up these 2M holes. The more pages we make
movable (see f.e. the targeted slab reclaim patchset that makes slab
pages movable) the more reliable higher order allocations become.
On Tue, 11 Sep 2007, Nick Piggin wrote:
> It would be interesting to craft an attack. If you knew roughly the layout
> and size of your dentry slab for example... maybe you could stat a whole
> lot of files, then open one and keep it open (maybe post the fd to a unix
> socket or something crazy!) when you think you have filled up a couple
> of MB worth of them. Repeat the process until your movable zone is
> gone. Or do the same things with pagetables, or task structs, or radix
> tree nodes, etc.. these are the kinds of things I worry about (as well as
> just the gradual natural degredation).
I guess you would have to run that without my targeted slab reclaim
patchset? Otherwise the slab that are in the way could be reclaimed and
you could not produce your test case.
On Tue, 11 Sep 2007, Andrea Arcangeli wrote:
> But the whole point is that with the config_page_shift, Nick's worst
> case scenario can't happen by design regardless of defrag or not
> defrag. While it can _definitely_ happen with SGI design (regardless
> of any defrag thing). We can still try to save some memory by
> defragging the slab a bit, but it's by far *not* required with
> config_page_shift. No defrag at all is required infact.
Which worst case scenario? So far this is all a bit foggy.
> Let's see how good the mmap support for variable order page size will
> work after the 2 weeks...
Yeah. Give us some failure scenarios please!
Odd. I keep arguing against the solution I prefer.
On Tue, 11 September 2007 21:20:52 +0200, Andrea Arcangeli wrote:
>
> The the problem with the slub fragmentation isn't a new problem, it
> happens in today kernels as well and at least the slab by design is
> meant to _defrag_ internally. So it's practically already solved and
> it provides some guarantee unlike the buddy allocator.
Slab defrag doesn't look like a solved problem. Basically, slab
allocator was designed to group similar objects together. Main reason
in this context is that similar objects have similar lifetimes. And it
is true that one dentry's lifetime is more likely to match another one's
that, say, a struct bio's.
But different dentries still have vastly different lifetimes. And with
that, fragmentation will continue to occur. So the problem is not
solved. It is a hell of a lot better than pre-slab days, just not
perfect.
> What I think you're missing is that for Nick's worst case to trigger
> with the config_page_shift design, you would need the _whole_ ram to
> be _at_least_once_ allocated completely in kernel stacks. If the whole
> 100% of ram wouldn't go allocated in slub as a pure kernel stack, such
> a scenario could never materialize.
Things get somewhat worse with multiple attack vectors (whether
malicious or accidental). Spending 20% of ram on each of {kernel
stacks, dentries, inodes, mlocked pages, size-XXX} would be sufficient.
The system can spend 20% on kernel stacks with 80% free, then spend 20%
on dentries with 60% free and 20% wasted in almost-free kernel stack
slabs, etc.
To argue in favor, for a change, the exact same scenario would be
possible with Christoph's solution as well. It would even be more
likely. Where in your case 20% of all memory has to go to each slab
cache at one time, only one page per largepage of that would be
necessary in Christophs case. The rest could be allocated for other
purposes.
So overall I prefer your approach, for whatever my two cents of armchair
oppinion are worth.
Jörn
--
I've never met a human being who would want to read 17,000 pages of
documentation, and if there was, I'd kill him to get him out of the
gene pool.
-- Joseph Costello
On Wednesday 12 September 2007 06:01, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > There is a limitation in the VM. Fragmentation. You keep saying this
> > is a solved issue and just assuming you'll be able to fix any cases
> > that come up as they happen.
> >
> > I still don't get the feeling you realise that there is a fundamental
> > fragmentation issue that is unsolvable with Mel's approach.
>
> Well my problem first of all is that you did not read the full message. It
> discusses that later and provides page pools to address the issue.
>
> Secondly you keep FUDding people with lots of theoretical concerns
> assuming Mel's approaches must fail. If there is an issue (I guess there
> must be right?) then please give us a concrete case of a failure that we
> can work against.
On the other hand, you ignore the potential failure cases, and ignore
the alternatives that do not have such cases.
On Tue, 11 September 2007 13:07:06 -0700, Christoph Lameter wrote:
>
> You may want to consider Mel's antifrag approaches which certainly
> decreases the chance of this occurring. Reclaim can open up the needed
> linear memory hole in a intentional way. The memory compaction approach
> can even move pages to open up these 2M holes. The more pages we make
> movable (see f.e. the targeted slab reclaim patchset that makes slab
> pages movable) the more reliable higher order allocations become.
I absolutely agree with your slab reclaim patchset. No argument here.
What I'm starting to wonder about is where your approach has advantages
over Andrea's. The chances of triggering something vaguely similar to
Nick's worst case scenario are certainly higher for your solution. So
unless there are other upsides it is just the second-best solution.
Jörn
--
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein
On Wednesday 12 September 2007 06:11, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > It would be interesting to craft an attack. If you knew roughly the
> > layout and size of your dentry slab for example... maybe you could stat a
> > whole lot of files, then open one and keep it open (maybe post the fd to
> > a unix socket or something crazy!) when you think you have filled up a
> > couple of MB worth of them. Repeat the process until your movable zone is
> > gone. Or do the same things with pagetables, or task structs, or radix
> > tree nodes, etc.. these are the kinds of things I worry about (as well as
> > just the gradual natural degredation).
>
> I guess you would have to run that without my targeted slab reclaim
> patchset? Otherwise the slab that are in the way could be reclaimed and
> you could not produce your test case.
I didn't realise you had patches to move pinned dentries, radix tree nodes,
task structs, page tables etc. Did I miss them in your last patchset?
On Tue, 11 Sep 2007, J?rn Engel wrote:
> What I'm starting to wonder about is where your approach has advantages
> over Andrea's. The chances of triggering something vaguely similar to
> Nick's worst case scenario are certainly higher for your solution. So
> unless there are other upsides it is just the second-best solution.
Nick's worst case scenario is already at least partially addressed by the
Lumpy Reclaim code in 2.6.23. His examples assume 2.6.22 or earlier
code.
The advantages of this approach over Andreas is basically that the 4k
filesystems still can be used as is. 4k is useful for binaries and for
text processing like used for compiles. Large Page sizes are useful for
file systems that contain large datasets (scientific data, multimedia
stuff, databases) for applications that depend on high I/O throughput.
On Tue, 11 Sep 2007, Nick Piggin wrote:
> > I guess you would have to run that without my targeted slab reclaim
> > patchset? Otherwise the slab that are in the way could be reclaimed and
> > you could not produce your test case.
>
> I didn't realise you had patches to move pinned dentries, radix tree nodes,
> task structs, page tables etc. Did I miss them in your last patchset?
You did not mention that in your earlier text. If these are issues then we
certainly can work on that. Could you first provide us some real failure
conditions so that we know that these are real problems?
On (11/09/07 11:44), Nick Piggin didst pronounce:
> On Wednesday 12 September 2007 01:36, Mel Gorman wrote:
> > On Tue, 2007-09-11 at 04:52 +1000, Nick Piggin wrote:
> > > On Tuesday 11 September 2007 16:03, Christoph Lameter wrote:
> > > > 5. VM scalability
> > > > Large block sizes mean less state keeping for the information being
> > > > transferred. For a 1TB file one needs to handle 256 million page
> > > > structs in the VM if one uses 4k page size. A 64k page size reduces
> > > > that amount to 16 million. If the limitation in existing filesystems
> > > > are removed then even higher reductions become possible. For very
> > > > large files like that a page size of 2 MB may be beneficial which
> > > > will reduce the number of page struct to handle to 512k. The
> > > > variable nature of the block size means that the size can be tuned at
> > > > file system creation time for the anticipated needs on a volume.
> > >
> > > There is a limitation in the VM. Fragmentation. You keep saying this
> > > is a solved issue and just assuming you'll be able to fix any cases
> > > that come up as they happen.
> > >
> > > I still don't get the feeling you realise that there is a fundamental
> > > fragmentation issue that is unsolvable with Mel's approach.
> >
> > I thought we had discussed this already at VM and reached something
> > resembling a conclusion. It was acknowledged that depending on
> > contiguous allocations to always succeed will get a caller into trouble
> > and they need to deal with fallback - whether the problem was
> > theoritical or not. It was also strongly pointed out that the large
> > block patches as presented would be vunerable to that problem.
>
> Well Christoph seems to still be spinning them as a solution for VM
> scalability and first class support for making contiguous IOs, large
> filesystem block sizes etc.
>
Yeah, I can't argue with you there. I was under the impression that we
would be dealing with this strictly as a second class solution to see
what it bought to help steer the direction of fsblock.
> At the VM summit I think the conclusion was that grouping by
> mobility could be merged. I'm still not thrilled by that, but I was
> going to get steamrolled[*] anyway... and seeing as the userspace
> hugepages is a relatively demanded workload and can be
> implemented in this way with basically no other changes to the
> kernel and already must have fallbacks.... then that's actually a
> reasonable case for it.
>
As you say, a difference is if we fail to allocate a hugepage, the world
does not end. It's been a well known problem for years and grouping pages
by mobility is aimed at relaxing some of the more painful points. It has
other uses as well, but each of them is expected to deal with failures with
contiguous range allocation.
> The higher order pagecache, again I'm just going to get steamrolled
> on, and it actually isn't so intrusive minus the mmap changes, so I
> didn't have much to reasonably say there.
>
If the mmap() change is bad, then it gets halted up.
> And I would have kept quiet this time too, except for the worrying idea
> to use higher order pages to fix the SLUB vs SLAB regression, and if
> the rationale for this patchset was more realistic.
>
I don't agree with using higher order pages to fix SLUB vs SLAB performance
issues either. SLUB has to be able to compete with SLAB on it's own terms. If
SLUB gains x% over SLAB in specialised cases with high orders, then fair
enough but minimally, SLUB has to perform the same as SLAB at order-0. Like
you, I think if we depend on SLUB using high orders to match SLAB, we are
going to get kicked further down the line.
However, this discussion belongs more with the non-existant-remove-slab patch.
Based on what we've seen since the summits, we need a thorough analysis
with benchmarks before making a final decision (kernbench, ebizzy, tbench
(netpipe if someone has the time/resources), hackbench and maybe sysbench
as well as something the filesystem people recommend to get good coverage
of the subsystems).
> [*] And I don't say steamrolled because I'm bitter and twisted :) I
> personally want the kernel to be perfect. But I realise it already isn't
> and for practical purposes people want these things, so I accept
> being overruled, no problem. The fact simply is -- I would have been
> steamrolled I think :P
>
I'd rather not get side-tracked here. I regret you feel stream-rolled but I
think grouping pages by mobility is the right thing to do for better usage
of the TLB by the kernel and for improving hugepage support in userspace
minimally. We never really did see eye-to-eye but this way, if I'm wrong
you get to chuck eggs down the line.
> > The alternatives were fs-block and increasing the size of order-0. It
> > was felt that fs-block was far away because it's complex and I thought
> > that increasing the pagesize like what Andrea suggested would lead to
> > internal fragmentation problems. Regrettably we didn't discuss Andrea's
> > approach in depth.
>
> Sure. And some people run workloads where fragmentation is likely never
> going to be a problem, they are shipping this poorly configured hardware
> now or soon, so they don't have too much interest in doing it right at this
> point, rather than doing it *now*. OK, that's a valid reason which is why I
> don't use the argument that we should do it correctly or never at all.
>
So are we saying the right thing to do is go with fs-block from day 1 once we
get it to optimistically use high-order pages? I think your concern might be
that if this goes in then it'll be harder to justify fsblock in the future
because it'll be solving a theoritical problem that takes months to trigger
if at all. i.e. The filesystem people will push because apparently large
block support as it is solves world peace. Is that accurate?
>
> > I *thought* that the end conclusion was that we would go with
> > Christoph's approach pending two things being resolved;
> >
> > o mmap() support that we agreed on is good
>
> In theory (and again for the filesystem guys who don't have to worry about
> it). In practice after seeing the patch it's not a nice thing for the VM to
> have to do.
>
That may be a good enough reason on it's own to delay this. It's a
technical provable point.
>
> > I also thought there was an acknowledgement that long-term, fs-block was
> > the way to go - possibly using contiguous pages optimistically instead
> > of virtual mapping the pages. At that point, it would be a general
> > solution and we could remove the warnings.
>
> I guess it is still in the air. I personally think a vmapping approach and/or
> teaching filesystems to do some nonlinear block metadata access is the
> way to go (strangely, this happens to be one of the fsblock paradigms!).
What a co-incidence :)
> OTOH, I'm not sure how much buy-in there was from the filesystems guys.
> Particularly Christoph H and XFS (which is strange because they already do
> vmapping in places).
>
I think they use vmapping because they have to, not because they want
to. They might be a lot happier with fsblock if it used contiguous pages
for large blocks whenever possible - I don't know for sure. The metadata
accessors they might be unhappy with because it's inconvenient but as
Christoph Hellwig pointed out at VM/FS, the filesystems who really care
will convert.
> That's understandable though. It is a lot of work for filesystems. But the
> reason I think it is the correct approach for larger block than soft-page
> size is that it doesn't have fundamental issues (assuming that virtually
> mapping the entire kernel is off the table).
>
Virtually mapping the entire kernel is still off the table. We don't have
recent figures but the last measured slowdown I'm aware of was in the 5-10%
range for kernbench when we break 1:1 virt:phys mapping althought that
may be because we also lose hugepage backing of the kernel portion of the
address space. I didn't look deeply at the time because it was bad
whatever the root cause.
> > Basically, to start out with, this was going to be an SGI-only thing so
> > they get to rattle out the issues we expect to encounter with large
> > blocks and help steer the direction of the
> > more-complex-but-safer-overall fs-block.
>
> That's what I expected, but it seems from the descriptions in the patches
> that it is also supposed to cure cancer :)
>
heh, fair enough. I guess that minimally, the leaders need warnings all
over the place as well or else we're going back to the drawing board or all
getting behind fs-block (or Andrea's of it comes to that) and pushing.
>
> > > The idea that there even _is_ a bug to fail when higher order pages
> > > cannot be allocated was also brushed aside by some people at the
> > > vm/fs summit.
> >
> > When that brushing occured, I thought I made it very clear what the
> > expectations were and that without fallback they would be taking a risk.
> > I am not sure if that message actually sank in or not.
>
> No, you have been good about that aspect. I wasn't trying to point to you
> at all here.
>
>
> > > I don't know if those people had gone through the
> > > math about this, but it goes somewhat like this: if you use a 64K
> > > page size, you can "run out of memory" with 93% of your pages free.
> > > If you use a 2MB page size, you can fail with 99.8% of your pages
> > > still free. That's 64GB of memory used on a 32TB Altix.
> >
> > That's the absolute worst case but yes, in theory this can occur and
> > it's safest to assume the situation will occur somewhere to someone. It
> > would be difficult to craft an attack to do it but conceivably a machine
> > running for a long enough time would trigger it particularly if the
> > large block allocations are GFP_NOIO or GFP_NOFS.
>
> It would be interesting to craft an attack. If you knew roughly the layout
> and size of your dentry slab for example... maybe you could stat a whole
> lot of files, then open one and keep it open (maybe post the fd to a unix
> socket or something crazy!) when you think you have filled up a couple
> of MB worth of them.
I might regret saying this, but it would be easier to craft an attack
using pagetable pages. It's woefully difficult to do but it's probably
doable. I say pagetables because while slub targetted reclaim is on the
cards and memory compaction exists for page cache pages, pagetables are
currently pinned with no prototype patch existing to deal with them.
> Repeat the process until your movable zone is
> gone. Or do the same things with pagetables, or task structs, or radix
> tree nodes, etc.. these are the kinds of things I worry about (as well as
> just the gradual natural degredation).
>
If we hit this problem at all, it'll be due to gradual natural degredation.
It used to be a case that jumbo ethernets reported problems after running
for weeks and we might encounter something similar with large blocks while it
lacks a fallback. We no longer see jumbo ethernet reports but the fact is we
don't know if it's because we fixed it or people gave up. Chances are people
will be more persistent with large blocks than they were with jumbo ethernet.
> Yeah, it might be reasonably possible to make an attack that would
> deplete most of higher order allocations while pinning somewhat close
> to just the theoretical minimum required.
>
> [snip]
>
> Thanks Mel. Fairly good summary I think.
>
>
> > > Basically, if you're placing your hopes for VM and IO scalability on
> > > this, then I think that's a totally broken thing to do and will end up
> > > making the kernel worse in the years to come (except maybe on some poor
> > > configurations of bad hardware).
> >
> > My magic 8-ball is in the garage.
> >
> > I thought the following plan was sane but I could be la-la
> >
> > 1. Go with large block + explosions to start with
> > - Second class feature at this point, not fully supported
> > - Experiment in different places to see what it gains (if anything)
> > 2. Get fs-block in slowly over time with the fallback options replacing
> > Christophs patches bit by bit
> > 3. Kick away warnings
> > - First class feature at this point, fully supported
>
> I guess that was my hope. The only problem I have with a 2nd class
> higher order pagecache on a *practical* technical issue is introducing
> more complexity in the VM for mmap. Andrea and Hugh are probably
> more guardians of that area of code than I, so if they're happy with the
> mmap stuff then again I can accept being overruled on this ;)
>
I'm happy to go with their decision on this one as well unless Andrea says
he hates it simply on the grounds he wants the PAGE_SIZE_SHIFT solution :)
> Then I would love to say #2 will go ahead (and I hope it would), but I
> can't force it down the throat of the filesystem maintainers just like I
> feel they can't force vm devs (me) to do a virtually mapped and
> defrag-able kernel :) Basically I'm trying to practice what I preach and
> I don't want to force fsblock onto anyone.
>
If the FS people really want it and they insist that this has to be a
#1 citizen then it's fsblock or make something new up. I'm still not 100%
convinced that Andrea's solution is immune from fragmentation problems. Also,
I don't think a virtually mapped and 100% defraggable kernel is going to
perform very well or I'd have gone down that road already.
> Maybe when ext2 is converted and if I can show it isn't a performance
> problem / too much complexity then I'll have another leg to stand on
> here... I don't know.
>
Even if the conversion is hard but only a few days work per filesystem,
it's difficult to argue against on any grounds other than name calling.
> > Independently of that, we would work on order-0 scalability,
> > particularly readahead and batching operations on ranges of pages as
> > much as possible.
>
> Definitely. Also, aops capable of spanning multiple pages, batching of
> large write(2) pagecache insertion, etc all are things we must go after,
> regardless of the large page and/or block size work.
>
Agreed.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wednesday 12 September 2007 06:01, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > There is a limitation in the VM. Fragmentation. You keep saying this
> > is a solved issue and just assuming you'll be able to fix any cases
> > that come up as they happen.
> >
> > I still don't get the feeling you realise that there is a fundamental
> > fragmentation issue that is unsolvable with Mel's approach.
>
> Well my problem first of all is that you did not read the full message. It
> discusses that later and provides page pools to address the issue.
>
> Secondly you keep FUDding people with lots of theoretical concerns
> assuming Mel's approaches must fail. If there is an issue (I guess there
> must be right?) then please give us a concrete case of a failure that we
> can work against.
And BTW, before you accuse me of FUD, I'm actually talking about the
fragmentation issues on which Mel I think mostly agrees with me at this
point.
Also have you really a rational reason why we should just up and accept
all these big changes happening just because that, while there are lots
of theoretical issues, the person pointing them out to you hasn't happened
to give you a concrete failure case. Oh, and the actual performance
benefit is actually not really even quantified yet, crappy hardware not
withstanding, and neither has a proper evaluation of the alternatives.
So... would you drive over a bridge if the engineer had this mindset?
On (11/09/07 12:26), Nick Piggin didst pronounce:
> On Wednesday 12 September 2007 04:31, Mel Gorman wrote:
> > On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote:
> > > On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> > > > that increasing the pagesize like what Andrea suggested would lead to
> > > > internal fragmentation problems. Regrettably we didn't discuss Andrea's
> > >
> > > The config_page_shift guarantees the kernel stacks or whatever not
> > > defragmentable allocation other allocation goes into the same 64k "not
> > > defragmentable" page. Not like with SGI design that a 8k kernel stack
> > > could be allocated in the first 64k page, and then another 8k stack
> > > could be allocated in the next 64k page, effectively pinning all 64k
> > > pages until Nick worst case scenario triggers.
> >
> > In practice, it's pretty difficult to trigger. Buddy allocators always
> > try and use the smallest possible sized buddy to split. Once a 64K is
> > split for a 4K or 8K allocation, the remainder of that block will be
> > used for other 4K, 8K, 16K, 32K allocations. The situation where
> > multiple 64K blocks gets split does not occur.
> >
> > Now, the worst case scenario for your patch is that a hostile process
> > allocates large amount of memory and mlocks() one 4K page per 64K chunk
> > (this is unlikely in practice I know). The end result is you have many
> > 64KB regions that are now unusable because 4K is pinned in each of them.
> > Your approach is not immune from problems either. To me, only Nicks
> > approach is bullet-proof in the long run.
>
> One important thing I think in Andrea's case, the memory will be accounted
> for (eg. we can limit mlock, or work within various memory accounting things).
>
For mlock()ed sure. Not for pagetables though, kmalloc slabs etc. It
might be a non-issue as well. Like the large block patches, there are
aspects of Andrea's case that we simply do not know.
> With fragmentation, I suspect it will be much more difficult to do this. It
> would be another layer of heuristics that will also inevitably go wrong
> at times if you try to limit how much "fragmentation" a process can do.
> Quite likely it is hard to make something even work reasonably well in
> most cases.
Regreattably, this is also woefully difficult to prove. For
fragmentation, I can look into having a more expensive version of
/proc/pagetypeinfo to give a detailed account of the current
fragmentation state but it's a side-issue.
> > > We can still try to save some memory by
> > > defragging the slab a bit, but it's by far *not* required with
> > > config_page_shift. No defrag at all is required infact.
> >
> > You will need to take some sort of defragmentation to deal with internal
> > fragmentation. It's a very similar problem to blasting away at slab
> > pages and still not being able to free them because objects are in use.
> > Replace "slab" with "large page" and "object" with "4k page" and the
> > issues are similar.
>
> Well yes and slab has issues today too with internal fragmentation,
> targetted reclaim and some (small) higher order allocations too today.
> But at least with config_page_shift, you don't introduce _new_ sources
> of problems (eg. coming from pagecache or other allocs).
>
Well, we do extend the internal fragmentation problem. Previous, it was
inode, dcache and friends. Now we have to deal with internal
fragmentation related to page tables, per-cpu pages etc. Maybe they can
be solved too, but they are of similar difficulty to what Christoph
faces.
> Sure, there are some other things -- like pagecache can actually use
> up more memory instead -- but there are a number of other positives
> that Andrea's has as well. It is using order-0 pages, which are first class
> throughout the VM; they have per-cpu queues, and do not require any
> special reclaim code.
Being able to use the per-cpu queues is a big plus.
> They also *actually do* reduce the page
> management overhead in the general case, unlike higher order pcache.
>
> So combined with the accounting issues, I think it is unfair to say that
> Andrea's is just moving the fragmentation to internal. It has a number
> of upsides. I have no idea how it will actually behave and perform, mind
> you ;)
>
Neither do I. Andrea's suggestion definitly has upsides. I'm just saying
it's not going to cure cancer any better than the large block patchset ;)
>
> > > Plus there's a cost in defragging and freeing cache... the more you
> > > need defrag, the slower the kernel will be.
> > >
> > > > approach in depth.
> > >
> > > Well it wasn't my fault if we didn't discuss it in depth though.
> >
> > If it's my fault, sorry about that. It wasn't my intention.
>
> I think it did get brushed aside a little quickly too (not blaming anyone).
> Maybe because Linus was hostile. But *if* the idea is that page
> management overhead has or will become a problem that needs fixing,
> then neither higher order pagecache, nor (obviously) fsblock, fixes this
> properly. Andrea's most definitely has the potential to.
Fair point.
What way to jump now is the question.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wednesday 12 September 2007 06:42, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > > I guess you would have to run that without my targeted slab reclaim
> > > patchset? Otherwise the slab that are in the way could be reclaimed and
> > > you could not produce your test case.
> >
> > I didn't realise you had patches to move pinned dentries, radix tree
> > nodes, task structs, page tables etc. Did I miss them in your last
> > patchset?
>
> You did not mention that in your earlier text.
Actually, I am pretty sure actually everything I mentioned was explicitly
things that your patches do not handle. This was not a coincidence.
> If these are issues then we
> certainly can work on that. Could you first provide us some real failure
> conditions so that we know that these are real problems?
I think I would have as good a shot as any to write a fragmentation
exploit, yes. I think I've given you enough info to do the same, so I'd
like to hear a reason why it is not a problem.
On (11/09/07 15:17), Nick Piggin didst pronounce:
> On Wednesday 12 September 2007 06:01, Christoph Lameter wrote:
> > On Tue, 11 Sep 2007, Nick Piggin wrote:
> > > There is a limitation in the VM. Fragmentation. You keep saying this
> > > is a solved issue and just assuming you'll be able to fix any cases
> > > that come up as they happen.
> > >
> > > I still don't get the feeling you realise that there is a fundamental
> > > fragmentation issue that is unsolvable with Mel's approach.
> >
> > Well my problem first of all is that you did not read the full message. It
> > discusses that later and provides page pools to address the issue.
> >
> > Secondly you keep FUDding people with lots of theoretical concerns
> > assuming Mel's approaches must fail. If there is an issue (I guess there
> > must be right?) then please give us a concrete case of a failure that we
> > can work against.
>
> And BTW, before you accuse me of FUD, I'm actually talking about the
> fragmentation issues on which Mel I think mostly agrees with me at this
> point.
>
I'm half way between you two on this one. I agree with Christoph in that
it's currently very difficult to trigger a failure scenario and today we
don't have a way of dealing with it. I agree with Nick in that conceivably a
failure scenario does exist somewhere and the careful person (or paranoid if
you prefer) would deal with it pre-emptively. The fact is that no one knows
what a large block workload is going to look like to the allocator so we're
all hand-waving.
Right now, I can't trigger the worst failure scenarious that cannot be
dealt with for fragmentation but that might change with large blocks. The
worst situation I can think is a process that continously dirties large
amounts of data on a large block filesystem while another set of processes
works with large amounts of anonymous data without any swap space configured
with slub_min_order set somewhere between order-0 and the large block size.
Fragmentation wise, that's just a kick in the pants and might produce
the failure scenario being looked for.
If it does fail, I don't think it should be used to beat Christoph with as
such because it was meant to be a #2 solution. What hits it is if the mmap()
change is unacceptable.
> Also have you really a rational reason why we should just up and accept
> all these big changes happening just because that, while there are lots
> of theoretical issues, the person pointing them out to you hasn't happened
> to give you a concrete failure case. Oh, and the actual performance
> benefit is actually not really even quantified yet, crappy hardware not
> withstanding, and neither has a proper evaluation of the alternatives.
>
Performance figures would be nice. dbench is flaky as hell but can
comparison figures be generated on one filesystem with 4K blocks and one
with 64K? I guess we can do it ourselves too because this should work on
normal machines.
> So... would you drive over a bridge if the engineer had this mindset?
>
If I had this bus that couldn't go below 50MPH, right...... never mind.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Tue, 11 Sep 2007, Mel Gorman wrote:
> > Well Christoph seems to still be spinning them as a solution for VM
> > scalability and first class support for making contiguous IOs, large
> > filesystem block sizes etc.
> >
>
> Yeah, I can't argue with you there. I was under the impression that we
> would be dealing with this strictly as a second class solution to see
> what it bought to help steer the direction of fsblock.
I think we all have the same impression. But should second class not be
okay for IO and FS in special situations?
> As you say, a difference is if we fail to allocate a hugepage, the world
> does not end. It's been a well known problem for years and grouping pages
> by mobility is aimed at relaxing some of the more painful points. It has
> other uses as well, but each of them is expected to deal with failures with
> contiguous range allocation.
Note that this patchset is only needing higher order pages up to 64k not
2M.
> > And I would have kept quiet this time too, except for the worrying idea
> > to use higher order pages to fix the SLUB vs SLAB regression, and if
> > the rationale for this patchset was more realistic.
> >
>
> I don't agree with using higher order pages to fix SLUB vs SLAB performance
> issues either. SLUB has to be able to compete with SLAB on it's own terms. If
> SLUB gains x% over SLAB in specialised cases with high orders, then fair
> enough but minimally, SLUB has to perform the same as SLAB at order-0. Like
> you, I think if we depend on SLUB using high orders to match SLAB, we are
> going to get kicked further down the line.
That issue is discussed elsewhere and we have a patch in mm to address the
issue.
> > In theory (and again for the filesystem guys who don't have to worry about
> > it). In practice after seeing the patch it's not a nice thing for the VM to
> > have to do.
> >
>
> That may be a good enough reason on it's own to delay this. It's a
> technical provable point.
It would be good to know what is wrong with the patch? I was surprised how
easy it was to implement mmap.
> I might regret saying this, but it would be easier to craft an attack
> using pagetable pages. It's woefully difficult to do but it's probably
> doable. I say pagetables because while slub targetted reclaim is on the
> cards and memory compaction exists for page cache pages, pagetables are
> currently pinned with no prototype patch existing to deal with them.
Hmmm... I thought Peter had a patch to move page table pages?
> If we hit this problem at all, it'll be due to gradual natural degredation.
> It used to be a case that jumbo ethernets reported problems after running
> for weeks and we might encounter something similar with large blocks while it
> lacks a fallback. We no longer see jumbo ethernet reports but the fact is we
> don't know if it's because we fixed it or people gave up. Chances are people
> will be more persistent with large blocks than they were with jumbo ethernet.
I have seen a failure recently with jumbo frames and order 2 allocs on
2.6.22. But then .22 has no lumpy reclaim.
On Tue, 11 Sep 2007, Nick Piggin wrote:
> I think I would have as good a shot as any to write a fragmentation
> exploit, yes. I think I've given you enough info to do the same, so I'd
> like to hear a reason why it is not a problem.
No you have not explained why the theoretical issues continue to exist
given even just considering Lumpy Reclaim in .23 nor what effect the
antifrag patchset would have. And you have used a 2M pagesize which is
irrelevant to this patchset that deals with blocksizes up to 64k. In my
experience the use of blocksize < PAGE_COSTLY_ORDER (32k) is reasonably
safe.
On Wednesday 12 September 2007 06:53, Mel Gorman wrote:
> On (11/09/07 11:44), Nick Piggin didst pronounce:
> However, this discussion belongs more with the non-existant-remove-slab
> patch. Based on what we've seen since the summits, we need a thorough
> analysis with benchmarks before making a final decision (kernbench, ebizzy,
> tbench (netpipe if someone has the time/resources), hackbench and maybe
> sysbench as well as something the filesystem people recommend to get good
> coverage of the subsystems).
True. Aside, it seems I might have been mistaken in saying Christoph
is proposing to use higher order allocations to fix the SLUB regression.
Anyway, I agree let's not get sidetracked about this here.
> I'd rather not get side-tracked here. I regret you feel stream-rolled but I
> think grouping pages by mobility is the right thing to do for better usage
> of the TLB by the kernel and for improving hugepage support in userspace
> minimally. We never really did see eye-to-eye but this way, if I'm wrong
> you get to chuck eggs down the line.
No it's a fair point, and even the hugepage allocations alone are a fair
point. From the discussions I think it seems like quite probably the right
thing to do pragmatically, which is what Linux is about and I hope will
result in a better kernel in the end. So I don't have complaints except
from little ivory tower ;)
> > Sure. And some people run workloads where fragmentation is likely never
> > going to be a problem, they are shipping this poorly configured hardware
> > now or soon, so they don't have too much interest in doing it right at
> > this point, rather than doing it *now*. OK, that's a valid reason which
> > is why I don't use the argument that we should do it correctly or never
> > at all.
>
> So are we saying the right thing to do is go with fs-block from day 1 once
> we get it to optimistically use high-order pages? I think your concern
> might be that if this goes in then it'll be harder to justify fsblock in
> the future because it'll be solving a theoritical problem that takes months
> to trigger if at all. i.e. The filesystem people will push because
> apparently large block support as it is solves world peace. Is that
> accurate?
Heh. It's hard to say. I think fsblock could take a while to implement,
regardless of high order pages or not. I actually would like to be able
to pass down a mandate to say higher order pagecache will never
get merged, simply so that these talented people would work on
fsblock ;)
But that's not my place to say, and I'm actually not arguing that high
order pagecache does not have uses (especially as a practical,
shorter-term solution which is unintrusive to filesystems).
So no, I don't think I'm really going against the basics of what we agreed
in Cambridge. But it sounds like it's still being billed as first-order
support right off the bat here.
> > OTOH, I'm not sure how much buy-in there was from the filesystems guys.
> > Particularly Christoph H and XFS (which is strange because they already
> > do vmapping in places).
>
> I think they use vmapping because they have to, not because they want
> to. They might be a lot happier with fsblock if it used contiguous pages
> for large blocks whenever possible - I don't know for sure. The metadata
> accessors they might be unhappy with because it's inconvenient but as
> Christoph Hellwig pointed out at VM/FS, the filesystems who really care
> will convert.
Sure, they would rather not to. But there are also a lot of ways you can
improve vmap more than what XFS does (or probably what darwin does)
(more persistence for cached objects, and batched invalidates for example).
There are also a lot of trivial things you can do to make a lot of those
accesses not require vmaps (and less trivial things, but even such things
as binary searches over multiple pages should be quite possible with a bit
of logic).
> > It would be interesting to craft an attack. If you knew roughly the
> > layout and size of your dentry slab for example... maybe you could stat a
> > whole lot of files, then open one and keep it open (maybe post the fd to
> > a unix socket or something crazy!) when you think you have filled up a
> > couple of MB worth of them.
>
> I might regret saying this, but it would be easier to craft an attack
> using pagetable pages. It's woefully difficult to do but it's probably
> doable. I say pagetables because while slub targetted reclaim is on the
> cards and memory compaction exists for page cache pages, pagetables are
> currently pinned with no prototype patch existing to deal with them.
But even so, you can just hold an open fd in order to pin the dentry you
want. My attack would go like this: get the page size and allocation group
size for the machine, then get the number of dentries required to fill a
slab. Then read in that many dentries and pin one of them. Repeat the
process. Even if there is other activity on the system, it seems possible
that such a thing will cause some headaches after not too long a time.
Some sources of pinned memory are going to be better than others for
this of course, so yeah maybe pagetables will be a bit easier (I don't know).
> > Then I would love to say #2 will go ahead (and I hope it would), but I
> > can't force it down the throat of the filesystem maintainers just like I
> > feel they can't force vm devs (me) to do a virtually mapped and
> > defrag-able kernel :) Basically I'm trying to practice what I preach and
> > I don't want to force fsblock onto anyone.
>
> If the FS people really want it and they insist that this has to be a
> #1 citizen then it's fsblock or make something new up.
Well I'm glad you agree :) I think not all do, but as you say maybe the
only thing is just to leave it up to the individual filesystems...
On Wednesday 12 September 2007 07:41, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > I think I would have as good a shot as any to write a fragmentation
> > exploit, yes. I think I've given you enough info to do the same, so I'd
> > like to hear a reason why it is not a problem.
>
> No you have not explained why the theoretical issues continue to exist
> given even just considering Lumpy Reclaim in .23 nor what effect the
> antifrag patchset would have.
So how does lumpy reclaim, your slab patches, or anti-frag have
much effect on the worst case situation? Or help much against a
targetted fragmentation attack?
> And you have used a 2M pagesize which is
> irrelevant to this patchset that deals with blocksizes up to 64k. In my
> experience the use of blocksize < PAGE_COSTLY_ORDER (32k) is reasonably
> safe.
I used EXACTLY the page sizes that you brought up in your patch
description (ie. 64K and 2MB).
On Tue, 11 Sep 2007, Nick Piggin wrote:
> But that's not my place to say, and I'm actually not arguing that high
> order pagecache does not have uses (especially as a practical,
> shorter-term solution which is unintrusive to filesystems).
>
> So no, I don't think I'm really going against the basics of what we agreed
> in Cambridge. But it sounds like it's still being billed as first-order
> support right off the bat here.
Well its seems that we have different interpretations of what was agreed
on. My understanding was that the large blocksize patchset was okay
provided that I supply an acceptable mmap implementation and put a
warning in.
> But even so, you can just hold an open fd in order to pin the dentry you
> want. My attack would go like this: get the page size and allocation group
> size for the machine, then get the number of dentries required to fill a
> slab. Then read in that many dentries and pin one of them. Repeat the
> process. Even if there is other activity on the system, it seems possible
> that such a thing will cause some headaches after not too long a time.
> Some sources of pinned memory are going to be better than others for
> this of course, so yeah maybe pagetables will be a bit easier (I don't know).
Well even without slab targeted reclaim: Mel's antifrag will sort the
dentries into separate blocks of memory and so isolate the issue.
On Tue, 11 Sep 2007, Nick Piggin wrote:
> > No you have not explained why the theoretical issues continue to exist
> > given even just considering Lumpy Reclaim in .23 nor what effect the
> > antifrag patchset would have.
>
> So how does lumpy reclaim, your slab patches, or anti-frag have
> much effect on the worst case situation? Or help much against a
> targetted fragmentation attack?
F.e. Lumpy reclaim reclaim neighboring pages and thus works against
fragmentation. So your formulae no longer works.
> > And you have used a 2M pagesize which is
> > irrelevant to this patchset that deals with blocksizes up to 64k. In my
> > experience the use of blocksize < PAGE_COSTLY_ORDER (32k) is reasonably
> > safe.
>
> I used EXACTLY the page sizes that you brought up in your patch
> description (ie. 64K and 2MB).
The patch currently only supports 64k. There is hope that it will support
2M at some point and as mentioned also a special large page pool facility
may be required.
Quoting from the post:
I would like to increase the supported blocksize to very large pages in
the future so that device drives will be capable of providing large
contiguous mapping. For that purpose I think that we need a mechanism to
reserve pools of varying large sizes at boot time. Such a mechanism can
also be used to compensate in situations where one wants to use larger
buffers but defragmentation support is not (yet?) capable to reliably
provide pages of the desired sizes.
On (11/09/07 14:48), Christoph Lameter didst pronounce:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
>
> > But that's not my place to say, and I'm actually not arguing that high
> > order pagecache does not have uses (especially as a practical,
> > shorter-term solution which is unintrusive to filesystems).
> >
> > So no, I don't think I'm really going against the basics of what we agreed
> > in Cambridge. But it sounds like it's still being billed as first-order
> > support right off the bat here.
>
> Well its seems that we have different interpretations of what was agreed
> on. My understanding was that the large blocksize patchset was okay
> provided that I supply an acceptable mmap implementation and put a
> warning in.
>
Warnings == #2 citizen in my mind with known potential failure cases. That
was the point I thought.
> > But even so, you can just hold an open fd in order to pin the dentry you
> > want. My attack would go like this: get the page size and allocation group
> > size for the machine, then get the number of dentries required to fill a
> > slab. Then read in that many dentries and pin one of them. Repeat the
> > process. Even if there is other activity on the system, it seems possible
> > that such a thing will cause some headaches after not too long a time.
> > Some sources of pinned memory are going to be better than others for
> > this of course, so yeah maybe pagetables will be a bit easier (I don't know).
>
> Well even without slab targeted reclaim: Mel's antifrag will sort the
> dentries into separate blocks of memory and so isolate the issue.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wednesday 12 September 2007 07:48, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > But that's not my place to say, and I'm actually not arguing that high
> > order pagecache does not have uses (especially as a practical,
> > shorter-term solution which is unintrusive to filesystems).
> >
> > So no, I don't think I'm really going against the basics of what we
> > agreed in Cambridge. But it sounds like it's still being billed as
> > first-order support right off the bat here.
>
> Well its seems that we have different interpretations of what was agreed
> on. My understanding was that the large blocksize patchset was okay
> provided that I supply an acceptable mmap implementation and put a
> warning in.
Yes. I think we differ on our interpretations of "okay". In my interpretation,
it is not OK to use this patch as a way to solve VM or FS or IO scalability
issues, especially not while the alternative approaches that do _not_ have
these problems have not been adequately compared or argued against.
> > But even so, you can just hold an open fd in order to pin the dentry you
> > want. My attack would go like this: get the page size and allocation
> > group size for the machine, then get the number of dentries required to
> > fill a slab. Then read in that many dentries and pin one of them. Repeat
> > the process. Even if there is other activity on the system, it seems
> > possible that such a thing will cause some headaches after not too long a
> > time. Some sources of pinned memory are going to be better than others
> > for this of course, so yeah maybe pagetables will be a bit easier (I
> > don't know).
>
> Well even without slab targeted reclaim: Mel's antifrag will sort the
> dentries into separate blocks of memory and so isolate the issue.
So even after all this time you do not understand what the fundamental
problem is with anti-frag and yet you are happy to waste both our time
in endless flamewars telling me how wrong I am about it.
Forgive me if I'm starting to be rude, Christoph. This is really irritating.
On Tue, Sep 11, 2007 at 01:41:08PM -0700, Christoph Lameter wrote:
> The advantages of this approach over Andreas is basically that the 4k
> filesystems still can be used as is. 4k is useful for binaries and for
If you mean that with my approach you can't use a 4k filesystem as is,
that's not correct. I even run the (admittedly premature but
promising) benchmarks on my patch on a 4k blocksized
filesystem... Guess what, you can even still mount a 1k fs on a 2.6
kernel.
The main advantage I can see in your patch is that distributions won't
need to ship a 64k PAGE_SIZE kernel rpm (but your single rpm will be
slower).
On Tue, 11 Sep 2007, Nick Piggin wrote:
> > Well its seems that we have different interpretations of what was agreed
> > on. My understanding was that the large blocksize patchset was okay
> > provided that I supply an acceptable mmap implementation and put a
> > warning in.
>
> Yes. I think we differ on our interpretations of "okay". In my interpretation,
> it is not OK to use this patch as a way to solve VM or FS or IO scalability
> issues, especially not while the alternative approaches that do _not_ have
> these problems have not been adequately compared or argued against.
We never talked about not being able to solve scalability issues with this
patchset. The alternate approaches were discussed at the VM MiniSummit and
at the VM/FS meeting. You organized the VM/FS summit. I know you were
there and were arguing for your approach. That was not sufficient?
> > Well even without slab targeted reclaim: Mel's antifrag will sort the
> > dentries into separate blocks of memory and so isolate the issue.
>
> So even after all this time you do not understand what the fundamental
> problem is with anti-frag and yet you are happy to waste both our time
> in endless flamewars telling me how wrong I am about it.
We obviously have discussed this before and the only point of asking this
question by you seems to be to have me repeat the whole line argument
again?
> Forgive me if I'm starting to be rude, Christoph. This is really irritating.
Sorry but I have had too much exposure to philosophy. Talk about absolutes
like guarantees (that do not exist even for order 0 allocs) and unlikely memory
fragmentation scenarios to show that something does not work seems to
be getting into some metaphysical realm where there is no data anymore
to draw any firm conclusions.
Software reliability is inherent probabilistic otherwise we would not have
things like CRC sums and SHA1 algorithms. Its just a matter of reducing
the failure rate sufficiently. The failure rate for lower order
allocations (0-3) seems to have been significantly reduced in 2.6.23
through lumpy reclaim.
If antifrag measures are not successful (likely for 2M allocs) then other
methods (like the large page pools that you skipped when reading my post)
will need to be used.
On Wed, 12 Sep 2007, Andrea Arcangeli wrote:
> On Tue, Sep 11, 2007 at 01:41:08PM -0700, Christoph Lameter wrote:
> > The advantages of this approach over Andreas is basically that the 4k
> > filesystems still can be used as is. 4k is useful for binaries and for
>
> If you mean that with my approach you can't use a 4k filesystem as is,
> that's not correct. I even run the (admittedly premature but
> promising) benchmarks on my patch on a 4k blocksized
> filesystem... Guess what, you can even still mount a 1k fs on a 2.6
> kernel.
Right you can use a 4k filesystem. The 4k blocks are buffers in a larger
page then.
> The main advantage I can see in your patch is that distributions won't
> need to ship a 64k PAGE_SIZE kernel rpm (but your single rpm will be
> slower).
I would think that your approach would be slower since you always have to
populate 1 << N ptes when mmapping a file? Plus there is a lot of wastage
of memory because even a file with one character needs an order N page? So
there are less pages available for the same workload.
Then you are breaking mmap assumptions of applications becaused the order
N kernel will no longer be able to map 4k pages. You likely need a new
binary format that has pages correctly aligned. I know that we would need
one on IA64 if we go beyond the established page sizes.
On Tue, Sep 11, 2007 at 04:00:17PM +1000, Nick Piggin wrote:
> > > OTOH, I'm not sure how much buy-in there was from the filesystems guys.
> > > Particularly Christoph H and XFS (which is strange because they already
> > > do vmapping in places).
> >
> > I think they use vmapping because they have to, not because they want
> > to. They might be a lot happier with fsblock if it used contiguous pages
> > for large blocks whenever possible - I don't know for sure. The metadata
> > accessors they might be unhappy with because it's inconvenient but as
> > Christoph Hellwig pointed out at VM/FS, the filesystems who really care
> > will convert.
>
> Sure, they would rather not to. But there are also a lot of ways you can
> improve vmap more than what XFS does (or probably what darwin does)
> (more persistence for cached objects, and batched invalidates for example).
XFS already has persistence across the object life time (which can be many
tens of seconds for a frequently used buffer) and it also does batched
unmapping of objects as well.
> There are also a lot of trivial things you can do to make a lot of those
> accesses not require vmaps (and less trivial things, but even such things
> as binary searches over multiple pages should be quite possible with a bit
> of logic).
Yes, we already do the many of these things (via xfs_buf_offset()), but
that is not good enough for something like a memcpy that spans multiple
pages in a large block (think btree block compaction, splits and recombines).
IOWs, we already play these vmap harm-minimisation games in the places
where we can, but still the overhead is high and something we'd prefer
to be able to avoid.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Wednesday 12 September 2007 11:49, David Chinner wrote:
> On Tue, Sep 11, 2007 at 04:00:17PM +1000, Nick Piggin wrote:
> > > > OTOH, I'm not sure how much buy-in there was from the filesystems
> > > > guys. Particularly Christoph H and XFS (which is strange because they
> > > > already do vmapping in places).
> > >
> > > I think they use vmapping because they have to, not because they want
> > > to. They might be a lot happier with fsblock if it used contiguous
> > > pages for large blocks whenever possible - I don't know for sure. The
> > > metadata accessors they might be unhappy with because it's inconvenient
> > > but as Christoph Hellwig pointed out at VM/FS, the filesystems who
> > > really care will convert.
> >
> > Sure, they would rather not to. But there are also a lot of ways you can
> > improve vmap more than what XFS does (or probably what darwin does)
> > (more persistence for cached objects, and batched invalidates for
> > example).
>
> XFS already has persistence across the object life time (which can be many
> tens of seconds for a frequently used buffer)
But you don't do a very good job. When you go above 64 cached mappings,
you purge _all_ of them. fsblock's vmap cache can have a much higher number
(if you want), and purging can just unmap a batch which is decided by a simple
LRU (thus important metadata gets saved).
> and it also does batched
> unmapping of objects as well.
It also could do a lot better at unmapping. Currently you're just calling
vunmap a lot of times in sequence. That still requires global IPIs and TLB
flushing every time.
This simple patch should easily be able to reduce that number by 2 or 3
orders of magnitude (maybe more on big systems).
http://www.mail-archive.com/[email protected]/msg03956.html
vmap area locking and data structures could also be made a lot better
quite easily, I suspect.
> > There are also a lot of trivial things you can do to make a lot of those
> > accesses not require vmaps (and less trivial things, but even such things
> > as binary searches over multiple pages should be quite possible with a
> > bit of logic).
>
> Yes, we already do the many of these things (via xfs_buf_offset()), but
> that is not good enough for something like a memcpy that spans multiple
> pages in a large block (think btree block compaction, splits and
> recombines).
fsblock_memcpy(fsblock *src, int soff, fsblock *dst, int doff, int size); ?
> IOWs, we already play these vmap harm-minimisation games in the places
> where we can, but still the overhead is high and something we'd prefer
> to be able to avoid.
I don't think you've looked nearly far enough with all this low hanging
fruit.
I just gave 4 things which combined might easily reduce xfs vmap overhead
by several orders of magnitude, all without changing much code at all.
On Tue, Sep 11, 2007 at 05:04:41PM -0700, Christoph Lameter wrote:
> I would think that your approach would be slower since you always have to
> populate 1 << N ptes when mmapping a file? Plus there is a lot of wastage
I don't have to populate them, I could just map one at time. The only
reason I want to populate every possible pte that could map that page
(by checking vma ranges) is to _improve_ performance by decreasing the
number of page faults of an order of magnitude. Then with the 62th bit
after NX giving me a 64k tlb, I could decrease the frequency of the
tlb misses too.
> of memory because even a file with one character needs an order N page? So
> there are less pages available for the same workload.
This is a known issue. The same is true for ppc64 64k. If that really
is an issue, that may need some generic solution with tail packing.
> Then you are breaking mmap assumptions of applications becaused the order
> N kernel will no longer be able to map 4k pages. You likely need a new
> binary format that has pages correctly aligned. I know that we would need
> one on IA64 if we go beyond the established page sizes.
No you misunderstood the whole design. My patch will be 100% backwards
compatible in all respects. If I could break backwards compatibility
70% of the complexity would go away...
Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
>
>> But that's not my place to say, and I'm actually not arguing that high
>> order pagecache does not have uses (especially as a practical,
>> shorter-term solution which is unintrusive to filesystems).
>>
>> So no, I don't think I'm really going against the basics of what we agreed
>> in Cambridge. But it sounds like it's still being billed as first-order
>> support right off the bat here.
>
> Well its seems that we have different interpretations of what was agreed
> on. My understanding was that the large blocksize patchset was okay
> provided that I supply an acceptable mmap implementation and put a
> warning in.
I think all we agreed on was that both patches needed significant work
and would need to be judged after they were completed ;-)
There was talk of putting Christoph's approach in more-or-less as-is
as a very specialized and limited application ... but I don't think
we concluded anything for the more general and long-term case apart
from "this is hard" ;-)
M.
On Wednesday 12 September 2007 07:52, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > > No you have not explained why the theoretical issues continue to exist
> > > given even just considering Lumpy Reclaim in .23 nor what effect the
> > > antifrag patchset would have.
> >
> > So how does lumpy reclaim, your slab patches, or anti-frag have
> > much effect on the worst case situation? Or help much against a
> > targetted fragmentation attack?
>
> F.e. Lumpy reclaim reclaim neighboring pages and thus works against
> fragmentation. So your formulae no longer works.
OK, I'll describe how it works and what the actual problem with it is. I
haven't looked at the patches for a fair while so you can forgive my
inaccuracies in terminology or exact details.
So anti-frag groups memory into (say) 2MB chunks. Top priority heuristic
is that allocations which are movable all go into groups with other movable
memory and allocations which are not movable do not go into these
groups. This is flexible though, so if a workload wants to use more non
movable memory, it is allowed to eat into first free, then movable
groups after filling all non-movable groups. This is important because
it is what makes anti-frag flexible (otherwise it would just be a memory
reserve in another form).
In my attack, I cause the kernel to allocate lots of unmovable allocations
and deplete movable groups. I theoretically then only need to keep a
small number (1/2^N) of these allocations around in order to DoS a
page allocation of order N.
And it doesn't even have to be a DoS. The natural fragmentation
that occurs today in a kernel today has the possibility to slowly push out
the movable groups and give you the same situation.
Now there are lots of other little heuristics, *including lumpy reclaim
and various slab reclaim improvements*, that improve the effectiveness
or speed of this thing, but at the end of the day, it has the same basic
issues. Unless you can move practically any currently unmovable
allocation (which will either be a lot of intrusive code or require a
vmapped kernel), then you can't get around the fundamental problem.
And if you do get around the fundamental problem, you don't really
need to group pages by mobility any more because they are all
movable[*].
So lumpy reclaim does not change my formula nor significantly help
against a fragmentation attack. AFAIKS.
[*] ok, this isn't quite true because if you can actually put a hard limit on
unmovable allocations then anti-frag will fundamentally help -- get back to
me on that when you get patches to move most of the obvious ones.
Like pinned dentries, inodes, buffer heads, page tables, task structs, mm
structs, vmas, anon_vmas, radix-tree nodes, etc.
> > > And you have used a 2M pagesize which is
> > > irrelevant to this patchset that deals with blocksizes up to 64k. In my
> > > experience the use of blocksize < PAGE_COSTLY_ORDER (32k) is reasonably
> > > safe.
> >
> > I used EXACTLY the page sizes that you brought up in your patch
> > description (ie. 64K and 2MB).
>
> The patch currently only supports 64k.
Sure, and I pointed out the theoretical figure for 64K pages as well. Is that
figure not problematic to you? Where do you draw the limit for what is
acceptable? Why? What happens with tiny memory machines where a reserve
or even the anti-frag patches may not be acceptable and/or work very well?
When do you require reserve pools? Why are reserve pools acceptable for
first-class support of filesystems when it has been very loudly been made a
known policy decision by Linus in the past (and for some valid reasons) that
we should not put limits on the sizes of caches in the kernel.
On Wednesday 12 September 2007 11:49, David Chinner wrote:
> On Tue, Sep 11, 2007 at 04:00:17PM +1000, Nick Piggin wrote:
> > > > OTOH, I'm not sure how much buy-in there was from the filesystems
> > > > guys. Particularly Christoph H and XFS (which is strange because they
> > > > already do vmapping in places).
> > >
> > > I think they use vmapping because they have to, not because they want
> > > to. They might be a lot happier with fsblock if it used contiguous
> > > pages for large blocks whenever possible - I don't know for sure. The
> > > metadata accessors they might be unhappy with because it's inconvenient
> > > but as Christoph Hellwig pointed out at VM/FS, the filesystems who
> > > really care will convert.
> >
> > Sure, they would rather not to. But there are also a lot of ways you can
> > improve vmap more than what XFS does (or probably what darwin does)
> > (more persistence for cached objects, and batched invalidates for
> > example).
>
> XFS already has persistence across the object life time (which can be many
> tens of seconds for a frequently used buffer)
But you don't do a very good job. When you go above 64 vmaps cached, you
purge _all_ of them. fsblock's vmap cache can have a much higher number
(if you want), and purging will only unmap a smaller batch, decided by a
simple LRU.
> and it also does batched
> unmapping of objects as well.
It also could do a lot better at unmapping. Currently you're just calling
vunmap a lot of times in sequence. That still requires global IPIs and TLB
flushing every time.
This simple patch should easily be able to reduce that number by 2 or 3
orders of magnitude on 64-bit systems. Maybe more if you increase the
batch size.
http://www.mail-archive.com/[email protected]/msg03956.html
vmap area manipulation scalability and search complexity could also be
improved quite easily, I suspect.
> > There are also a lot of trivial things you can do to make a lot of those
> > accesses not require vmaps (and less trivial things, but even such things
> > as binary searches over multiple pages should be quite possible with a
> > bit of logic).
>
> Yes, we already do the many of these things (via xfs_buf_offset()), but
> that is not good enough for something like a memcpy that spans multiple
> pages in a large block (think btree block compaction, splits and
> recombines).
fsblock_memcpy(fsblock *src, int soff, fsblock *dst, int doff, int size); ?
> IOWs, we already play these vmap harm-minimisation games in the places
> where we can, but still the overhead is high and something we'd prefer
> to be able to avoid.
I don't think you've looked very far with all this low hanging fruit.
The several ways I suggested combined might easily reduce xfs vmap
overhead by several orders of magnitude, all without changing much
code at all.
Can you provide a formula to reproduce these workloads where vmap
overhead in XFS is a problem? (huge IO capacity need not be an issue,
because I could try reproducing it on a 64-way on ramdisks for example).
On Wednesday 12 September 2007 10:00, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > Yes. I think we differ on our interpretations of "okay". In my
> > interpretation, it is not OK to use this patch as a way to solve VM or FS
> > or IO scalability issues, especially not while the alternative approaches
> > that do _not_ have these problems have not been adequately compared or
> > argued against.
>
> We never talked about not being able to solve scalability issues with this
> patchset. The alternate approaches were discussed at the VM MiniSummit and
> at the VM/FS meeting. You organized the VM/FS summit. I know you were
> there and were arguing for your approach. That was not sufficient?
I will still argue that my approach is the better technical solution for large
block support than yours, I don't think we made progress on that. And I'm
quite sure we agreed at the VM summit not to rely on your patches for
VM or IO scalability.
> > So even after all this time you do not understand what the fundamental
> > problem is with anti-frag and yet you are happy to waste both our time
> > in endless flamewars telling me how wrong I am about it.
>
> We obviously have discussed this before and the only point of asking this
> question by you seems to be to have me repeat the whole line argument
> again?
But you just showed in two emails that you don't understand what the
problem is. To reiterate: lumpy reclaim does *not* invalidate my formulae;
and antifrag does *not* isolate the issue.
> > Forgive me if I'm starting to be rude, Christoph. This is really
> > irritating.
>
> Sorry but I have had too much exposure to philosophy. Talk about absolutes
> like guarantees (that do not exist even for order 0 allocs) and unlikely
> memory fragmentation scenarios to show that something does not work seems
> to be getting into some metaphysical realm where there is no data anymore
> to draw any firm conclusions.
Some problems can not be solved easily or at all within reasonable
constraints. I have no problems with that. And it is a valid stance to
take on the fragmentation issue, and the hash collision problem, etc.
But what do you say about viable alternatives that do not have to
worry about these "unlikely scenarios", full stop? So, why should we
not use fsblock for higher order page support?
Last times this came up, you dismissed the fsblock approach because it
adds another layer of complexity (even though it is simpler than the
buffer layer it replaces); and also had some other strange objection like
you thought it provides higher order pages or something.
And as far as the problems I bring up with your approach, you say they
shouldn't be likely, don't really matter, can be fixed as we go along, or
want me to show they are a problem in a real world situation!! :P
> Software reliability is inherent probabilistic otherwise we would not have
> things like CRC sums and SHA1 algorithms. Its just a matter of reducing
> the failure rate sufficiently. The failure rate for lower order
> allocations (0-3) seems to have been significantly reduced in 2.6.23
> through lumpy reclaim.
>
> If antifrag measures are not successful (likely for 2M allocs) then other
> methods (like the large page pools that you skipped when reading my post)
> will need to be used.
I didn't skip that. We have large page pools today. How does that give
first class of support to those allocations if you have to have memory
reserves?
On Wed, 12 Sep 2007, Nick Piggin wrote:
> In my attack, I cause the kernel to allocate lots of unmovable allocations
> and deplete movable groups. I theoretically then only need to keep a
> small number (1/2^N) of these allocations around in order to DoS a
> page allocation of order N.
True. That is why we want to limit the number of unmovable allocations and
that is why ZONE_MOVABLE exists to limit those. However, unmovable
allocations are already rare today. The overwhelming majority of
allocations are movable and reclaimable. You can see that f.e. by looking
at /proc/meminfo and see how high SUnreclaim: is (does not catch
everything but its a good indicator).
> Now there are lots of other little heuristics, *including lumpy reclaim
> and various slab reclaim improvements*, that improve the effectiveness
> or speed of this thing, but at the end of the day, it has the same basic
All of these methods also have their own purpose aside from the mobility
patches.
> issues. Unless you can move practically any currently unmovable
> allocation (which will either be a lot of intrusive code or require a
> vmapped kernel), then you can't get around the fundamental problem.
> And if you do get around the fundamental problem, you don't really
> need to group pages by mobility any more because they are all
> movable[*].
>
> So lumpy reclaim does not change my formula nor significantly help
> against a fragmentation attack. AFAIKS.
Lumpy reclaim improves the situation significantly because the
overwhelming majority of allocation during the lifetime of a systems are
movable and thus it is able to opportunistically restore the availability
of higher order pages by reclaiming neighboring pages.
> [*] ok, this isn't quite true because if you can actually put a hard limit on
> unmovable allocations then anti-frag will fundamentally help -- get back to
> me on that when you get patches to move most of the obvious ones.
We have this hard limit using ZONE_MOVABLE in 2.6.23.
> > The patch currently only supports 64k.
>
> Sure, and I pointed out the theoretical figure for 64K pages as well. Is that
> figure not problematic to you? Where do you draw the limit for what is
> acceptable? Why? What happens with tiny memory machines where a reserve
> or even the anti-frag patches may not be acceptable and/or work very well?
> When do you require reserve pools? Why are reserve pools acceptable for
> first-class support of filesystems when it has been very loudly been made a
> known policy decision by Linus in the past (and for some valid reasons) that
> we should not put limits on the sizes of caches in the kernel.
64K pages may problematic because it is above the PAGE_ORDER_COSTLY in
2.6.23. 32K is currently much safer because lumpy reclaim can restore
these and does so on my systems. I expect the situation for 64K pages to
improve when more of Mel's patches go in. We have long term experience
with 32k sized allocation through Andrew's tree.
Reserve pools as handled (by the not yet available) large page pool
patches (which again has altogether another purpose) are not a limit. The
reserve pools are used to provide a mininum of higher order pages that is
not broken down in order to insure that a mininum number of the desired
order of pages is even available in your worst case scenario. Mainly I
think that is needed during the period when memory defragmentation is
still under development.
On Wed, 12 Sep 2007, Nick Piggin wrote:
> I will still argue that my approach is the better technical solution for large
> block support than yours, I don't think we made progress on that. And I'm
> quite sure we agreed at the VM summit not to rely on your patches for
> VM or IO scalability.
The approach has already been tried (see the XFS layer) and found lacking.
Having a fake linear block through vmalloc means that a special software
layer must be introduced and we may face special casing in the block / fs
layer to check if we have one of these strange vmalloc blocks.
> But you just showed in two emails that you don't understand what the
> problem is. To reiterate: lumpy reclaim does *not* invalidate my formulae;
> and antifrag does *not* isolate the issue.
I do understand what the problem is. I just do not get what your problem
with this is and why you have this drive to demand perfection. We are
working a variety of approaches on the (potential) issue but you
categorically state that it cannot be solved.
> But what do you say about viable alternatives that do not have to
> worry about these "unlikely scenarios", full stop? So, why should we
> not use fs block for higher order page support?
Because it has already been rejected in another form and adds more
layering to the filesystem and more checking for special cases in which
we only have virtual linearity? It does not reduce the number of page
structs that have to be handled by the lower layers etc.
Maybe we coud get to something like a hybrid that avoids some of these
issues? Add support so something like a virtual compound page can be
handled transparently in the filesystem layer with special casing if
such a beast reaches the block layer?
> I didn't skip that. We have large page pools today. How does that give
> first class of support to those allocations if you have to have memory
> reserves?
See my other mail. That portion is not complete yet. Sorry.
On Wed, Sep 12, 2007 at 01:27:33AM +1000, Nick Piggin wrote:
> > IOWs, we already play these vmap harm-minimisation games in the places
> > where we can, but still the overhead is high and something we'd prefer
> > to be able to avoid.
>
> I don't think you've looked nearly far enough with all this low hanging
> fruit.
Ok, so we need to hack the vm to optimise it further. When it comes to
TLB flush code and optimising that sort of stuff, I'm out of my depth.
> I just gave 4 things which combined might easily reduce xfs vmap overhead
> by several orders of magnitude, all without changing much code at all.
Patches would be greatly appreciately. You obviously understand this
vm code much better than I do, so if it's easy to fix by adding some
generic vmap cache thingy, please do.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Thursday 13 September 2007 11:49, David Chinner wrote:
> On Wed, Sep 12, 2007 at 01:27:33AM +1000, Nick Piggin wrote:
> > I just gave 4 things which combined might easily reduce xfs vmap overhead
> > by several orders of magnitude, all without changing much code at all.
>
> Patches would be greatly appreciately. You obviously understand this
> vm code much better than I do, so if it's easy to fix by adding some
> generic vmap cache thingy, please do.
Well, it may not be easy to _fix_, but it's easy to try a few improvements ;)
How do I make an image and run a workload that will coerce XFS into
doing a significant number of vmaps?
On (12/09/07 16:17), Christoph Lameter didst pronounce:
> On Wed, 12 Sep 2007, Nick Piggin wrote:
>
> > I will still argue that my approach is the better technical solution for large
> > block support than yours, I don't think we made progress on that. And I'm
> > quite sure we agreed at the VM summit not to rely on your patches for
> > VM or IO scalability.
>
> The approach has already been tried (see the XFS layer) and found lacking.
>
> Having a fake linear block through vmalloc means that a special software
> layer must be introduced and we may face special casing in the block / fs
> layer to check if we have one of these strange vmalloc blocks.
>
One of Nick's points is that to have a 100% reliable solution, that is
what is required. We already have a layering between the VM and the FS
but my understanding is that fsblock replaces rather than adds to it.
Surely, we'll be able to detect the situation where the memory is really
contiguous as a fast path and have a slower path where fragmentation was
a problem.
> > But you just showed in two emails that you don't understand what the
> > problem is. To reiterate: lumpy reclaim does *not* invalidate my formulae;
> > and antifrag does *not* isolate the issue.
>
> I do understand what the problem is. I just do not get what your problem
> with this is and why you have this drive to demand perfection. We are
> working a variety of approaches on the (potential) issue but you
> categorically state that it cannot be solved.
>
This is going in circles.
His point is that we also cannot prove it is 100% correct in all
situations. Without taking additional (expensive) steps, there will be a
workload that fragments physical memory. He doesn't know what it is and neither
do we, but that does not mean that someone else will find it. He also has a
point about the slow degredation of fragmentation that is woefully difficult
to reproduce. We've had this provability of correctness problem before.
His initial problem was not with the patches as such but the fact that they
seemed to be presented as a 1st class feature that we fully support and
is a potential solution for some VM and IO Scalability problems. This is
not the case, we have to treat it as a 2nd class feature until we *know* no
situation exists where it breaks down. These patches on their own would have
to run for months if not a year or so before we could be really sure about it.
The only implementation question about these patches that hasn't been addressed
is the mmap() support. What's wrong with it in it's current form. Can it be
fixed or if it's fundamentally screwed etc. That has fallen by the
wayside.
> > But what do you say about viable alternatives that do not have to
> > worry about these "unlikely scenarios", full stop? So, why should we
> > not use fs block for higher order page support?
>
> Because it has already been rejected in another form and adds more
> layering to the filesystem and more checking for special cases in which
> we only have virtual linearity? It does not reduce the number of page
> structs that have to be handled by the lower layers etc.
>
Unless callers always use an iterator for blocks that is optimised in the
physically linear case to be a simple array offset and when not physically
linear it either walks chains (complex) or uses vmap (must deal with TLB
flushes amoung other things). If it optimistically uses physically contiguous
memory, we may find a way to use only one page struct as well.
> Maybe we coud get to something like a hybrid that avoids some of these
> issues?
Or gee whiz, I don't know. Start with your patches as a strictly 2nd class
citizen and build fsblock in while trying to keep use of physically contiguous
memory where possible and it makes sense.
> Add support so something like a virtual compound page can be
> handled transparently in the filesystem layer with special casing if
> such a beast reaches the block layer?
>
> > I didn't skip that. We have large page pools today. How does that give
> > first class of support to those allocations if you have to have memory
> > reserves?
>
> See my other mail. That portion is not complete yet. Sorry.
>
I am *very* wary of using reserve pools for anything other than
emergency situations. If nothing else pools == wasted memory + a sizing
problem. But hey, it is one option.
Are we going to agree on some sort of plan or are we just going to
handwave ourselves to death?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, Sep 13, 2007 at 03:23:21AM +1000, Nick Piggin wrote:
> On Thursday 13 September 2007 11:49, David Chinner wrote:
> > On Wed, Sep 12, 2007 at 01:27:33AM +1000, Nick Piggin wrote:
>
> > > I just gave 4 things which combined might easily reduce xfs vmap overhead
> > > by several orders of magnitude, all without changing much code at all.
> >
> > Patches would be greatly appreciately. You obviously understand this
> > vm code much better than I do, so if it's easy to fix by adding some
> > generic vmap cache thingy, please do.
>
> Well, it may not be easy to _fix_, but it's easy to try a few improvements ;)
>
> How do I make an image and run a workload that will coerce XFS into
> doing a significant number of vmaps?
# mkfs.xfs -n size=16384 <dev>
to create a filesystem with a 16k directory block size on a 4k page
machine.
Then just do operations on directories with lots of files in them
(tens of thousands). Every directory operation will require at
least one vmap in this situation - e.g. a traversal will result in
lots and lots of blocks being read that will require vmap() for every
directory block read from disk and an unmap almost immediately
afterwards when the reference is dropped....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Thursday 13 September 2007 23:03, David Chinner wrote:
> On Thu, Sep 13, 2007 at 03:23:21AM +1000, Nick Piggin wrote:
> > Well, it may not be easy to _fix_, but it's easy to try a few
> > improvements ;)
> >
> > How do I make an image and run a workload that will coerce XFS into
> > doing a significant number of vmaps?
>
> # mkfs.xfs -n size=16384 <dev>
>
> to create a filesystem with a 16k directory block size on a 4k page
> machine.
>
> Then just do operations on directories with lots of files in them
> (tens of thousands). Every directory operation will require at
> least one vmap in this situation - e.g. a traversal will result in
> lots and lots of blocks being read that will require vmap() for every
> directory block read from disk and an unmap almost immediately
> afterwards when the reference is dropped....
Ah, wow, thanks: I can reproduce it.
On Thu, 13 Sep 2007, Mel Gorman wrote:
> Surely, we'll be able to detect the situation where the memory is really
> contiguous as a fast path and have a slower path where fragmentation was
> a problem.
Yes I have a draft here now of a virtual compound page solution that I am
testing with SLUB. Page allocator first tries to allocate a contiguous
page. If that is not possible then we string together a contiguous page
through order 0 allocs and vmalloc.
> The only implementation question about these patches that hasn't been addressed
> is the mmap() support. What's wrong with it in it's current form. Can it be
> fixed or if it's fundamentally screwed etc. That has fallen by the
> wayside.
Yes I have not heard about those. Got some more ideas how to clean it up
even more in the meantime. No feedback usually means no objections.....
> I am *very* wary of using reserve pools for anything other than
> emergency situations. If nothing else pools == wasted memory + a sizing
> problem. But hey, it is one option.
Well we need the page pools anyways for large page sizes > MAX_ORDER. But
> Are we going to agree on some sort of plan or are we just going to
> handwave ourselves to death?
Ill try to do the virtual mapped compound pages. Hopefully that will
allow fallback for compound pages in a generic way.
On Thursday 13 September 2007 12:01, Nick Piggin wrote:
> On Thursday 13 September 2007 23:03, David Chinner wrote:
> > Then just do operations on directories with lots of files in them
> > (tens of thousands). Every directory operation will require at
> > least one vmap in this situation - e.g. a traversal will result in
> > lots and lots of blocks being read that will require vmap() for every
> > directory block read from disk and an unmap almost immediately
> > afterwards when the reference is dropped....
>
> Ah, wow, thanks: I can reproduce it.
OK, the vunmap batching code wipes your TLB flushing and IPIs off
the table. Diffstat below, but the TLB portions are here (besides that
_everything_ is probably lower due to less TLB misses caused by the
TLB flushing):
-170 -99.4% sn2_send_IPI
-343 -100.0% sn_send_IPI_phys
-17911 -99.9% smp_call_function
Total performance went up by 30% on a 64-way system (248 seconds to
172 seconds to run parallel finds over different huge directories).
23012 54790.5% _read_lock
9427 329.0% __get_vm_area_node
5792 0.0% __find_vm_area
1590 53000.0% __vunmap
107 26.0% _spin_lock
74 119.4% _xfs_buf_find
58 0.0% __unmap_kernel_range
53 36.6% kmem_zone_alloc
-129 -100.0% pio_phys_write_mmr
-144 -100.0% unmap_kernel_range
-170 -99.4% sn2_send_IPI
-233 -59.1% kfree
-266 -100.0% find_next_bit
-343 -100.0% sn_send_IPI_phys
-564 -19.9% xfs_iget_core
-1946 -100.0% remove_vm_area
-17911 -99.9% smp_call_function
-62726 -7.2% _write_lock
-438360 -64.2% default_idle
-482631 -30.4% total
Next I have some patches to scale the vmap locks and data
structures better, but they're not quite ready yet. This looks like it
should result in a further speedup of several times when combined
with the TLB flushing reductions here...
On Thursday 13 September 2007 09:06, Christoph Lameter wrote:
> On Wed, 12 Sep 2007, Nick Piggin wrote:
> > So lumpy reclaim does not change my formula nor significantly help
> > against a fragmentation attack. AFAIKS.
>
> Lumpy reclaim improves the situation significantly because the
> overwhelming majority of allocation during the lifetime of a systems are
> movable and thus it is able to opportunistically restore the availability
> of higher order pages by reclaiming neighboring pages.
I'm talking about non movable allocations.
> > [*] ok, this isn't quite true because if you can actually put a hard
> > limit on unmovable allocations then anti-frag will fundamentally help --
> > get back to me on that when you get patches to move most of the obvious
> > ones.
>
> We have this hard limit using ZONE_MOVABLE in 2.6.23.
So we're back to 2nd class support.
> > Sure, and I pointed out the theoretical figure for 64K pages as well. Is
> > that figure not problematic to you? Where do you draw the limit for what
> > is acceptable? Why? What happens with tiny memory machines where a
> > reserve or even the anti-frag patches may not be acceptable and/or work
> > very well? When do you require reserve pools? Why are reserve pools
> > acceptable for first-class support of filesystems when it has been very
> > loudly been made a known policy decision by Linus in the past (and for
> > some valid reasons) that we should not put limits on the sizes of caches
> > in the kernel.
>
> 64K pages may problematic because it is above the PAGE_ORDER_COSTLY in
> 2.6.23. 32K is currently much safer because lumpy reclaim can restore
> these and does so on my systems. I expect the situation for 64K pages to
> improve when more of Mel's patches go in. We have long term experience
> with 32k sized allocation through Andrew's tree.
>
> Reserve pools as handled (by the not yet available) large page pool
> patches (which again has altogether another purpose) are not a limit. The
> reserve pools are used to provide a mininum of higher order pages that is
> not broken down in order to insure that a mininum number of the desired
> order of pages is even available in your worst case scenario. Mainly I
> think that is needed during the period when memory defragmentation is
> still under development.
fsblock doesn't need any of those hacks, of course.
On Thursday 13 September 2007 09:17, Christoph Lameter wrote:
> On Wed, 12 Sep 2007, Nick Piggin wrote:
> > I will still argue that my approach is the better technical solution for
> > large block support than yours, I don't think we made progress on that.
> > And I'm quite sure we agreed at the VM summit not to rely on your patches
> > for VM or IO scalability.
>
> The approach has already been tried (see the XFS layer) and found lacking.
It is lacking because our vmap algorithms are simplistic to the point
of being utterly inadequate for the new requirements. There has not
been any fundamental problem revealed (like the fragmentation
approach has).
However fsblock can do everything that higher order pagecache can
do in terms of avoiding vmap and giving contiguous memory to block
devices by opportunistically allocating higher orders of pages, and falling
back to vmap if they cannot be satisfied.
So if you argue that vmap is a downside, then please tell me how you
consider the -ENOMEM of your approach to be better?
> Having a fake linear block through vmalloc means that a special software
> layer must be introduced and we may face special casing in the block / fs
> layer to check if we have one of these strange vmalloc blocks.
I guess you're a little confused. There is nothing fake about the linear
address. Filesystems of course need changes (the block layer needs
none -- why would it?). But changing APIs to accommodate a better
solution is what Linux is about.
If, by special software layer, you mean the vmap/vunmap support in
fsblock, let's see... that's probably all of a hundred or two lines.
Contrast that with anti-fragmentation, lumpy reclaim, higher order
pagecache and its new special mmap layer... Hmm, seems like a no
brainer to me. You really still want to persue the "extra layer"
argument as a point against fsblock here?
> > But you just showed in two emails that you don't understand what the
> > problem is. To reiterate: lumpy reclaim does *not* invalidate my
> > formulae; and antifrag does *not* isolate the issue.
>
> I do understand what the problem is. I just do not get what your problem
> with this is and why you have this drive to demand perfection. We are
Oh. I don't think I could explain that if you still don't understand by now.
But that's not the main issue: all that I ask is you consider fsblock on
technical grounds.
> working a variety of approaches on the (potential) issue but you
> categorically state that it cannot be solved.
Of course I wouldn't state that. On the contrary, I categorically state that
I have already solved it :)
> > But what do you say about viable alternatives that do not have to
> > worry about these "unlikely scenarios", full stop? So, why should we
> > not use fs block for higher order page support?
>
> Because it has already been rejected in another form and adds more
You have rejected it. But they are bogus reasons, as I showed above.
You also describe some other real (if lesser) issues like number of page
structs to manage in the pagecache. But this is hardly enough to reject
my patch now... for every downside you can point out in my approach, I
can point out one in yours.
- fsblock doesn't require changes to virtual memory layer
- fsblock can retain cache of just 4K in a > 4K block size file
How about those? I know very well how Linus feels about both of them.
> Maybe we coud get to something like a hybrid that avoids some of these
> issues? Add support so something like a virtual compound page can be
> handled transparently in the filesystem layer with special casing if
> such a beast reaches the block layer?
That's conceptually much worse, IMO.
And practically worse as well: vmap space is limited on 32-bit; fsblock
approach can avoid vmap completely in many cases; for two reasons.
The fsblock data accessor APIs aren't _that_ bad changes. They change
zero conceptually in the filesystem, are arguably cleaner, and can be
essentially nooped if we wanted to stay with a b_data type approach
(but they give you that flexibility to replace it with any implementation).
Hi,
Nick Piggin <[email protected]> writes:
> In my attack, I cause the kernel to allocate lots of unmovable allocations
> and deplete movable groups. I theoretically then only need to keep a
> small number (1/2^N) of these allocations around in order to DoS a
> page allocation of order N.
I'm assuming that when an unmovable allocation hijacks a movable group
any further unmovable alloc will evict movable objects out of that
group before hijacking another one. right?
> And it doesn't even have to be a DoS. The natural fragmentation
> that occurs today in a kernel today has the possibility to slowly push out
> the movable groups and give you the same situation.
How would you cause that? Say you do want to purposefully place one
unmovable 4k page into every 64k compund page. So you allocate
4K. First 64k page locked. But now, to get 4K into the second 64K page
you have to first use up all the rest of the first 64k page. Meaning
one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then
will a new 64k chunk be broken and become locked.
So to get the last 64k chunk used all previous 32k chunks need to be
blocked and you need to allocate 32k (or less if more is blocked). For
all previous 32k chunks to be blocked every second 16k needs to be
blocked. To block the last of those 16k chunks all previous 8k chunks
need to be blocked and you need to allocate 8k. For all previous 8k
chunks to be blocked every second 4k page needs to be used. To alloc
the last of those 4k pages all previous 4k pages need to be used.
So to construct a situation where no continious 64k chunk is free you
have to allocate <total mem> - 64k - 32k - 16k - 8k - 4k (or there
about) of memory first. Only then could you free memory again while
still keeping every 64k page blocked. Does that occur naturally given
enough ram to start with?
Too see how bad fragmentation could be I wrote a little progamm to
simulate allocations with the following simplified alogrithm:
Memory management:
- Free pages are kept in buckets, one per order, and sorted by address.
- alloc() the front page (smallest address) out of the bucket of the
right order or recursively splits the next higher bucket.
- free() recursively tries to merge a page with its neighbour and puts
the result back into the proper bucket (sorted by address).
Allocation and lifetime:
- Every tick a new page is allocated with random order.
- The order is a triangle distribution with max at 0 (throw 2 dice,
add the eyes, subtract 7, abs() the number).
- The page is scheduled to be freed after X ticks. Where X is nearly
a gaus curve centered at 0 and maximum at <total num pages> * 1.5.
(What I actualy do is throw 8 dice and sum them up and shift the
result.)
Display:
I start with a white window. Every page allocation draws a black box
from the address of the page and as wide as the page is big (-1 pixel to
give a seperation to the next page). Every page free draws a yellow
box in place of the black one. Yellow to show where a page was in use
at one point while white means the page was never used.
As the time ticks the memory fills up. Quickly at first and then comes
to a stop around 80% filled. And then something interesting
happens. The yellow regions (previously used but now free) start
drifting up. Small pages tend to end up in the lower addresses and big
pages at the higher addresses. The memory defragments itself to some
degree.
http://mrvn.homeip.net/fragment/
Simulating 256MB ram and after 1472943 ticks and 530095 4k, 411841 8k,
295296 16k, 176647 32k and 59064 64k allocations you get this:
http://mrvn.homeip.net/fragment/256mb.png
Simulating 1GB ram and after 5881185 ticks and 2116671 4k, 1645957
8k, 1176994 16k, 705873 32k and 235690 64k allocations you get this:
http://mrvn.homeip.net/fragment/1gb.png
MfG
Goswin
On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
> Nick Piggin <[email protected]> writes:
>
> > In my attack, I cause the kernel to allocate lots of unmovable allocations
> > and deplete movable groups. I theoretically then only need to keep a
> > small number (1/2^N) of these allocations around in order to DoS a
> > page allocation of order N.
>
> I'm assuming that when an unmovable allocation hijacks a movable group
> any further unmovable alloc will evict movable objects out of that
> group before hijacking another one. right?
>
No eviction takes place. If an unmovable allocation gets placed in a
movable group, then steps are taken to ensure that future unmovable
allocations will take place in the same range (these decisions take
place in __rmqueue_fallback()). When choosing a movable block to
pollute, it will also choose the lowest possible block in PFN terms to
steal so that fragmentation pollution will be as confined as possible.
Evicting the unmovable pages would be one of those expensive steps that
have been avoided to date.
> > And it doesn't even have to be a DoS. The natural fragmentation
> > that occurs today in a kernel today has the possibility to slowly push out
> > the movable groups and give you the same situation.
>
> How would you cause that? Say you do want to purposefully place one
> unmovable 4k page into every 64k compund page. So you allocate
> 4K. First 64k page locked. But now, to get 4K into the second 64K page
> you have to first use up all the rest of the first 64k page. Meaning
> one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then
> will a new 64k chunk be broken and become locked.
It would be easier early in the boot to mmap a large area and fault it
in in virtual address order then mlock every a page every 64K. Early in
the systems lifetime, there will be a rough correlation between physical
and virtual memory.
Without mlock(), the most successful attack will like mmap() a 60K
region and fault it in as an attempt to get pagetable pages placed in
every 64K region. This strategy would not work with grouping pages by
mobility though as it would group the pagetable pages together.
Targetted attacks on grouping pages by mobility are not very easy and
not that interesting either. As Nick suggests, the natural fragmentation
over long periods of time is what is interesting.
> So to get the last 64k chunk used all previous 32k chunks need to be
> blocked and you need to allocate 32k (or less if more is blocked). For
> all previous 32k chunks to be blocked every second 16k needs to be
> blocked. To block the last of those 16k chunks all previous 8k chunks
> need to be blocked and you need to allocate 8k. For all previous 8k
> chunks to be blocked every second 4k page needs to be used. To alloc
> the last of those 4k pages all previous 4k pages need to be used.
>
> So to construct a situation where no continious 64k chunk is free you
> have to allocate <total mem> - 64k - 32k - 16k - 8k - 4k (or there
> about) of memory first. Only then could you free memory again while
> still keeping every 64k page blocked. Does that occur naturally given
> enough ram to start with?
>
I believe it's very difficult to craft an attack that will work in a
short period of time. An attack that worked on 2.6.22 as well may have
no success on 2.6.23-rc4-mm1 for example as grouping pages by mobility
does it make it exceedingly hard to craft an attack unless the attacker
can mlock large amounts of memory.
>
> Too see how bad fragmentation could be I wrote a little progamm to
> simulate allocations with the following simplified alogrithm:
>
> Memory management:
> - Free pages are kept in buckets, one per order, and sorted by address.
> - alloc() the front page (smallest address) out of the bucket of the
> right order or recursively splits the next higher bucket.
> - free() recursively tries to merge a page with its neighbour and puts
> the result back into the proper bucket (sorted by address).
>
> Allocation and lifetime:
> - Every tick a new page is allocated with random order.
This step in itself is not representative of what happens in the kernel.
The vast vast majority of allocations are order-0. It's a fun analysis
but I'm not sure can we draw any conclusions from it.
Statistical analysis of the buddy algorithm have implied that it doesn't
suffer that badly from external fragmentation but we know in practice
that things are different. A model is hard because minimally the
lifetime of pages varies widely.
> - The order is a triangle distribution with max at 0 (throw 2 dice,
> add the eyes, subtract 7, abs() the number).
> - The page is scheduled to be freed after X ticks. Where X is nearly
> a gaus curve centered at 0 and maximum at <total num pages> * 1.5.
> (What I actualy do is throw 8 dice and sum them up and shift the
> result.)
>
I doubt this is how the kernel behaves either.
> Display:
> I start with a white window. Every page allocation draws a black box
> from the address of the page and as wide as the page is big (-1 pixel to
> give a seperation to the next page). Every page free draws a yellow
> box in place of the black one. Yellow to show where a page was in use
> at one point while white means the page was never used.
>
> As the time ticks the memory fills up. Quickly at first and then comes
> to a stop around 80% filled. And then something interesting
> happens. The yellow regions (previously used but now free) start
> drifting up. Small pages tend to end up in the lower addresses and big
> pages at the higher addresses. The memory defragments itself to some
> degree.
>
> http://mrvn.homeip.net/fragment/
>
> Simulating 256MB ram and after 1472943 ticks and 530095 4k, 411841 8k,
> 295296 16k, 176647 32k and 59064 64k allocations you get this:
> http://mrvn.homeip.net/fragment/256mb.png
>
> Simulating 1GB ram and after 5881185 ticks and 2116671 4k, 1645957
> 8k, 1176994 16k, 705873 32k and 235690 64k allocations you get this:
> http://mrvn.homeip.net/fragment/1gb.png
>
These type of pictures feel somewhat familiar
(http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg).
--
Mel Gorman
On Fri, 14 Sep 2007, Nick Piggin wrote:
> > > [*] ok, this isn't quite true because if you can actually put a hard
> > > limit on unmovable allocations then anti-frag will fundamentally help --
> > > get back to me on that when you get patches to move most of the obvious
> > > ones.
> >
> > We have this hard limit using ZONE_MOVABLE in 2.6.23.
>
> So we're back to 2nd class support.
2nd class support for me means a feature that is not enabled by default
but that can be enabled in order to increase performance. The 2nd class
support is there because we are not yet sure about the maturity of the
memory allocation methods.
> > Reserve pools as handled (by the not yet available) large page pool
> > patches (which again has altogether another purpose) are not a limit. The
> > reserve pools are used to provide a mininum of higher order pages that is
> > not broken down in order to insure that a mininum number of the desired
> > order of pages is even available in your worst case scenario. Mainly I
> > think that is needed during the period when memory defragmentation is
> > still under development.
>
> fsblock doesn't need any of those hacks, of course.
Nor does mine for the low orders that we are considering. For order >
MAX_ORDER this is unavoidable since the page allocator cannot manage such
large pages. It can be used for lower order if there are issues (that I
have not seen yet).
On Fri, 14 Sep 2007, Nick Piggin wrote:
> However fsblock can do everything that higher order pagecache can
> do in terms of avoiding vmap and giving contiguous memory to block
> devices by opportunistically allocating higher orders of pages, and falling
> back to vmap if they cannot be satisfied.
fsblock is restricted to the page cache and cannot be used in other
contexts where subsystems can benefit from larger linear memory.
> So if you argue that vmap is a downside, then please tell me how you
> consider the -ENOMEM of your approach to be better?
That is again pretty undifferentiated. Are we talking about low page
orders? There we will reclaim the all of reclaimable memory before getting
an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine
has 256 milllion 4k pages--and the unmovable ratios we see today it
would require a very strange setup to get an allocation failure while
still be able to allocate order 0 pages.
With the ZONE_MOVABLE you can remove the unmovable objects into a defined
pool then higher order success rates become reasonable.
> If, by special software layer, you mean the vmap/vunmap support in
> fsblock, let's see... that's probably all of a hundred or two lines.
> Contrast that with anti-fragmentation, lumpy reclaim, higher order
> pagecache and its new special mmap layer... Hmm, seems like a no
> brainer to me. You really still want to persue the "extra layer"
> argument as a point against fsblock here?
Yes sure. You code could not live without these approaches. Without the
antifragmentation measures your fsblock code would not be very successful
in getting the larger contiguous segments you need to improve performance.
(There is no new mmap layer, the higher order pagecache is simply the old
API with set_blocksize expanded).
> Of course I wouldn't state that. On the contrary, I categorically state that
> I have already solved it :)
Well then I guess that you have not read the requirements...
> > Because it has already been rejected in another form and adds more
>
> You have rejected it. But they are bogus reasons, as I showed above.
Thats not me. I am working on this because many of the filesystem people
have repeatedly asked me to do this. I am no expert on filesystems.
> You also describe some other real (if lesser) issues like number of page
> structs to manage in the pagecache. But this is hardly enough to reject
> my patch now... for every downside you can point out in my approach, I
> can point out one in yours.
>
> - fsblock doesn't require changes to virtual memory layer
Therefore it is not a generic change but special to the block layer. So
other subsystems still have to deal with the single page issues on
their own.
> > Maybe we coud get to something like a hybrid that avoids some of these
> > issues? Add support so something like a virtual compound page can be
> > handled transparently in the filesystem layer with special casing if
> > such a beast reaches the block layer?
>
> That's conceptually much worse, IMO.
Why: It is the same approach that you use. If it is barely ever used and
satisfies your concern then I am fine with it.
> And practically worse as well: vmap space is limited on 32-bit; fsblock
> approach can avoid vmap completely in many cases; for two reasons.
>
> The fsblock data accessor APIs aren't _that_ bad changes. They change
> zero conceptually in the filesystem, are arguably cleaner, and can be
> essentially nooped if we wanted to stay with a b_data type approach
> (but they give you that flexibility to replace it with any implementation).
The largeblock changes are generic. They improve general handling of
compound pages, they make the existing APIs work for large units of
memory, they are not adding additional new API layers.
On Fri, 14 Sep 2007, Christoph Lameter wrote:
> an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine
s/1G/1T/ Sigh.
> has 256 milllion 4k pages--and the unmovable ratios we see today it
256k for 1G.
Mel Gorman <[email protected]> writes:
> On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
>> Nick Piggin <[email protected]> writes:
>>
>> > In my attack, I cause the kernel to allocate lots of unmovable allocations
>> > and deplete movable groups. I theoretically then only need to keep a
>> > small number (1/2^N) of these allocations around in order to DoS a
>> > page allocation of order N.
>>
>> I'm assuming that when an unmovable allocation hijacks a movable group
>> any further unmovable alloc will evict movable objects out of that
>> group before hijacking another one. right?
>>
>
> No eviction takes place. If an unmovable allocation gets placed in a
> movable group, then steps are taken to ensure that future unmovable
> allocations will take place in the same range (these decisions take
> place in __rmqueue_fallback()). When choosing a movable block to
> pollute, it will also choose the lowest possible block in PFN terms to
> steal so that fragmentation pollution will be as confined as possible.
> Evicting the unmovable pages would be one of those expensive steps that
> have been avoided to date.
But then you can have all blocks filled with movable data, free 4K in
one group, allocate 4K unmovable to take over the group, free 4k in
the next group, take that group and so on. You can end with 4k
unmovable in every 64k easily by accident.
There should be a lot of preassure for movable objects to vacate a
mixed group or you do get fragmentation catastrophs. Looking at my
little test program evicting movable objects from a mixed group should
not be that expensive as it doesn't happen often. The cost of it
should be freeing some pages (or finding free ones in a movable group)
and then memcpy. With my simplified simulation it never happens so I
expect it to only happen when the work set changes.
>> > And it doesn't even have to be a DoS. The natural fragmentation
>> > that occurs today in a kernel today has the possibility to slowly push out
>> > the movable groups and give you the same situation.
>>
>> How would you cause that? Say you do want to purposefully place one
>> unmovable 4k page into every 64k compund page. So you allocate
>> 4K. First 64k page locked. But now, to get 4K into the second 64K page
>> you have to first use up all the rest of the first 64k page. Meaning
>> one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then
>> will a new 64k chunk be broken and become locked.
>
> It would be easier early in the boot to mmap a large area and fault it
> in in virtual address order then mlock every a page every 64K. Early in
> the systems lifetime, there will be a rough correlation between physical
> and virtual memory.
>
> Without mlock(), the most successful attack will like mmap() a 60K
> region and fault it in as an attempt to get pagetable pages placed in
> every 64K region. This strategy would not work with grouping pages by
> mobility though as it would group the pagetable pages together.
But even with mlock the virtual pages should still be movable. So if
you evict movable objects from mixed group when needed all the
pagetable pages would end up in the same mixed group slowly taking it
over completly. No fragmentation at all. See how essential that
feature is. :)
> Targetted attacks on grouping pages by mobility are not very easy and
> not that interesting either. As Nick suggests, the natural fragmentation
> over long periods of time is what is interesting.
>
>> So to get the last 64k chunk used all previous 32k chunks need to be
>> blocked and you need to allocate 32k (or less if more is blocked). For
>> all previous 32k chunks to be blocked every second 16k needs to be
>> blocked. To block the last of those 16k chunks all previous 8k chunks
>> need to be blocked and you need to allocate 8k. For all previous 8k
>> chunks to be blocked every second 4k page needs to be used. To alloc
>> the last of those 4k pages all previous 4k pages need to be used.
>>
>> So to construct a situation where no continious 64k chunk is free you
>> have to allocate <total mem> - 64k - 32k - 16k - 8k - 4k (or there
>> about) of memory first. Only then could you free memory again while
>> still keeping every 64k page blocked. Does that occur naturally given
>> enough ram to start with?
>>
>
> I believe it's very difficult to craft an attack that will work in a
> short period of time. An attack that worked on 2.6.22 as well may have
> no success on 2.6.23-rc4-mm1 for example as grouping pages by mobility
> does it make it exceedingly hard to craft an attack unless the attacker
> can mlock large amounts of memory.
>
>>
>> Too see how bad fragmentation could be I wrote a little progamm to
>> simulate allocations with the following simplified alogrithm:
>>
>> Memory management:
>> - Free pages are kept in buckets, one per order, and sorted by address.
>> - alloc() the front page (smallest address) out of the bucket of the
>> right order or recursively splits the next higher bucket.
>> - free() recursively tries to merge a page with its neighbour and puts
>> the result back into the proper bucket (sorted by address).
>>
>> Allocation and lifetime:
>> - Every tick a new page is allocated with random order.
>
> This step in itself is not representative of what happens in the kernel.
> The vast vast majority of allocations are order-0. It's a fun analysis
> but I'm not sure can we draw any conclusions from it.
I skewed the distribution to that end. Maybe not enough but I wanted
to get quite a big of large pages. Also I'm only simulating the
unmovable objects and I would expect the I/O layer to make a lot of
higher order allocs/free to benefit from compound pages. If nobody
uses them then what is the point adding them?
> Statistical analysis of the buddy algorithm have implied that it doesn't
> suffer that badly from external fragmentation but we know in practice
> that things are different. A model is hard because minimally the
> lifetime of pages varies widely.
>
>> - The order is a triangle distribution with max at 0 (throw 2 dice,
>> add the eyes, subtract 7, abs() the number).
>> - The page is scheduled to be freed after X ticks. Where X is nearly
>> a gaus curve centered at 0 and maximum at <total num pages> * 1.5.
>> (What I actualy do is throw 8 dice and sum them up and shift the
>> result.)
>>
>
> I doubt this is how the kernel behaves either.
I had to pick something. I agree that the lifetime part is the hardest
to simulate and the point where an attack would start.
>> Display:
>> I start with a white window. Every page allocation draws a black box
>> from the address of the page and as wide as the page is big (-1 pixel to
>> give a seperation to the next page). Every page free draws a yellow
>> box in place of the black one. Yellow to show where a page was in use
>> at one point while white means the page was never used.
>>
>> As the time ticks the memory fills up. Quickly at first and then comes
>> to a stop around 80% filled. And then something interesting
>> happens. The yellow regions (previously used but now free) start
>> drifting up. Small pages tend to end up in the lower addresses and big
>> pages at the higher addresses. The memory defragments itself to some
>> degree.
>>
>> http://mrvn.homeip.net/fragment/
>>
>> Simulating 256MB ram and after 1472943 ticks and 530095 4k, 411841 8k,
>> 295296 16k, 176647 32k and 59064 64k allocations you get this:
>> http://mrvn.homeip.net/fragment/256mb.png
>>
>> Simulating 1GB ram and after 5881185 ticks and 2116671 4k, 1645957
>> 8k, 1176994 16k, 705873 32k and 235690 64k allocations you get this:
>> http://mrvn.homeip.net/fragment/1gb.png
>>
>
> These type of pictures feel somewhat familiar
> (http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg).
Those look a lot better and look like they are actually real kernel
data. How did you make them and can one create them in real-time (say
once a second or so)?
There seem to be an awfull lot of pinned pages inbetween the movable.
I would verry much like to see the same data with evicting of movable
pages out of mixed groups. I see not a single movable group while with
strict eviction there could be at most one mixed group per order.
MfG
Goswin
Christoph Lameter <[email protected]> writes:
> On Fri, 14 Sep 2007, Christoph Lameter wrote:
>
>> an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine
>
> s/1G/1T/ Sigh.
>
>> has 256 milllion 4k pages--and the unmovable ratios we see today it
>
> 256k for 1G.
256k == 64 pages for 1GB ram or 256k pages == 1Mb?
MfG
Goswin
On Tue, 11 Sep 2007 14:12:26 +0200 J?rn Engel <[email protected]> wrote:
> While I agree with your concern, those numbers are quite silly. The
> chances of 99.8% of pages being free and the remaining 0.2% being
> perfectly spread across all 2MB large_pages are lower than those of SHA1
> creating a collision.
Actually it'd be pretty easy to craft an application which allocates seven
pages for pagecache, then one for <something>, then seven for pagecache, then
one for <something>, etc.
I've had test apps which do that sort of thing accidentally. The result
wasn't pretty.
Andrew Morton <[email protected]> writes:
> On Tue, 11 Sep 2007 14:12:26 +0200 J?rn Engel <[email protected]> wrote:
>
>> While I agree with your concern, those numbers are quite silly. The
>> chances of 99.8% of pages being free and the remaining 0.2% being
>> perfectly spread across all 2MB large_pages are lower than those of SHA1
>> creating a collision.
>
> Actually it'd be pretty easy to craft an application which allocates seven
> pages for pagecache, then one for <something>, then seven for pagecache, then
> one for <something>, etc.
>
> I've had test apps which do that sort of thing accidentally. The result
> wasn't pretty.
Except that the applications 7 pages are movable and the <something>
would have to be unmovable. And then they should not share the same
memory region. At least they should never be allowed to interleave in
such a pattern on a larger scale.
The only way a fragmentation catastroph can be (proovable) avoided is
by having so few unmovable objects that size + max waste << ram
size. The smaller the better. Allowing movable and unmovable objects
to mix means that max waste goes way up. In your example waste would
be 7*size. With 2MB uper order limit it would be 511*size.
I keep coming back to the fact that movable objects should be moved
out of the way for unmovable ones. Anything else just allows
fragmentation to build up.
MfG
Goswin
On Sat, Sep 15, 2007 at 02:14:42PM +0200, Goswin von Brederlow wrote:
> I keep coming back to the fact that movable objects should be moved
> out of the way for unmovable ones. Anything else just allows
That's incidentally exactly what the slab does, no need to reinvent
the wheel for that, it's an old problem and there's room for
optimization in the slab partial-reuse logic too. Just boost the order
0 page size and use the slab to get the 4k chunks. The sgi/defrag
design is backwards.
Andrea Arcangeli <[email protected]> writes:
> On Sat, Sep 15, 2007 at 02:14:42PM +0200, Goswin von Brederlow wrote:
>> I keep coming back to the fact that movable objects should be moved
>> out of the way for unmovable ones. Anything else just allows
>
> That's incidentally exactly what the slab does, no need to reinvent
> the wheel for that, it's an old problem and there's room for
> optimization in the slab partial-reuse logic too. Just boost the order
> 0 page size and use the slab to get the 4k chunks. The sgi/defrag
> design is backwards.
How does that help? Will slabs move objects around to combine two
partially filled slabs into nearly full one? If not consider this:
- You create a slab for 4k objects based on 64k compound pages.
(first of all that wastes you a page already for the meta infos)
- Something movable allocates a 14 4k page in there making the slab
partially filled.
- Something unmovable alloactes a 4k page making the slab mixed and
full.
- Repeat until out of memory.
OR
- Userspace allocates a lot of memory in those slabs.
- Userspace frees one in every 15 4k chunks.
- Userspace forks 1000 times causing an unmovable task structure to
appear in 1000 slabs.
MfG
Goswin
On Sat, Sep 15, 2007 at 10:14:44PM +0200, Goswin von Brederlow wrote:
> How does that help? Will slabs move objects around to combine two
1. It helps providing a few guarantees: when you run "/usr/bin/free"
you won't get a random number, but a strong _guarantee_. That ram will
be available no matter what.
With variable order page size you may run oom by mlocking some half
free ram in pagecache backed by largepages. "free" becomes a fake
number provided by a weak design.
Apps and admin need to know for sure the ram that is available to be
able to fine-tune the workload to avoid running into swap but while
using all available ram at the same time.
> partially filled slabs into nearly full one? If not consider this:
2. yes, slab can indeed be freed to release an excessive number of 64k
pages pinned by an insignificant number of small objects. I already
told to Mel even at the VM summit, that the slab defrag can payoff
regardless, and this is nothing new, since it will payoff even
today with 2.6.23 with regard to kmalloc(32).
> - You create a slab for 4k objects based on 64k compound pages.
> (first of all that wastes you a page already for the meta infos)
There's not just 1 4k object in the system... The whole point is to
make sure all those 4k objects goes into the same 64k page. This way
for you to be able to reproduce Nick's worst case scenario you have to
allocate total_ram/4k objects large 4k...
> - Something movable allocates a 14 4k page in there making the slab
> partially filled.
Movable? I rather assume all slab allocations aren't movable. Then
slab defrag can try to tackle on users like dcache and inodes. Keep in
mind that with the exception of updatedb, those inodes/dentries will
be pinned and you won't move them, which is why I prefer to consider
them not movable too... since there's no guarantee they are.
> - Something unmovable alloactes a 4k page making the slab mixed and
> full.
The entire slab being full is a perfect scenario. It means zero memory
waste, it's actually the ideal scenario, I can't follow your logic...
> - Repeat until out of memory.
for(;;) kmalloc(32); is supposed to run oom, no breaking news here...
> - Userspace allocates a lot of memory in those slabs.
If with slabs you mean slab/slub, I can't follow, there has never been
a single byte of userland memory allocated there since ever the slab
existed in linux.
> - Userspace frees one in every 15 4k chunks.
I guess you're confusing the config-page-shift design with the sgi
design where userland memory gets mixed with slab entries in the same
64k page... Also with config-page-shift the userland pages will all be
64k.
Things will get more complicated if we later decide to allow
kmalloc(4k) pagecache to be mapped in userland instead of only being
available for reads. But then we can restrict that to a slab and to
make it relocatable by following the ptes. That will complicate things
a lot.
But the whole point is that you don't need all that complexity,
and that as long as you're ok to lose some memory, you will get a
strong guarantee when "free" tells you 1G is free or available as
cache.
> - Userspace forks 1000 times causing an unmovable task structure to
> appear in 1000 slabs.
If 1000 kmem_cache_alloc(kernel_stack) in a row will keep pinned 1000
64k slab pages it means previously there have at least been
64k/8k*1000 simultaneous tasks allocated at once, not just your 1000
fork.
Even if when "free" says there's 1G free, it wouldn't be a 100% strong
guarantee, and even if the slab wouldn't provide strong defrag
avoidance guarantees by design, splitting pages down in the core, and
then merging them up outside the core, sounds less efficient than
keeping the pages large in the core, and then splitting them outside
the core for the few non-performance critical small users. We're not
talking about laptops here, if the major load happens on tiny things
and tiny objects nobody should compile a kernel with 64k page size,
which is why there need to be 2 rpm to get peak performance.
Andrea Arcangeli <[email protected]> writes:
> On Sat, Sep 15, 2007 at 10:14:44PM +0200, Goswin von Brederlow wrote:
>> - Userspace allocates a lot of memory in those slabs.
>
> If with slabs you mean slab/slub, I can't follow, there has never been
> a single byte of userland memory allocated there since ever the slab
> existed in linux.
This and other comments in your reply show me that you completly
misunderstood what I was talking about.
Look at
http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg
The red dots (pinned) are dentries, page tables, kernel stacks,
whatever kernel stuff, right?
The green dots (movable) are mostly userspace pages being mapped
there, right?
What I was refering too is that because movable objects (green dots)
aren't moved out of a mixed group (the boxes) when some unmovable
object needs space all the groups become mixed over time. That means
the unmovable objects are spread out over all the ram and the buddy
system can't recombine regions when unmovable objects free them. There
will nearly always be some movable objects in the other buddy. The
system of having unmovable and movable groups breaks down and becomes
useless.
I'm assuming here that we want the possibility of larger order pages
for unmovable objects (large continiuos regions for DMA for example)
than the smallest order user space gets (or any movable object). If
mmap() still works on 4k page bounaries then those will fragment all
regions into 4k chunks in the worst case.
Obviously if userspace has a minimum order of 64k chunks then it will
never break any region smaller than 64k chunks and will never cause a
fragmentation catastroph. I know that is verry roughly your aproach
(make order 0 bigger), and I like it, but it has some limits as to how
big you can make it. I don't think my system with 1GB ram would work
so well with 2MB order 0 pages. But I wasn't refering to that but to
the picture.
MfG
Goswin
On Sun, Sep 16, 2007 at 03:54:56PM +0200, Goswin von Brederlow wrote:
> Andrea Arcangeli <[email protected]> writes:
>
> > On Sat, Sep 15, 2007 at 10:14:44PM +0200, Goswin von Brederlow wrote:
> >> - Userspace allocates a lot of memory in those slabs.
> >
> > If with slabs you mean slab/slub, I can't follow, there has never been
> > a single byte of userland memory allocated there since ever the slab
> > existed in linux.
>
> This and other comments in your reply show me that you completly
> misunderstood what I was talking about.
>
> Look at
> http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg
What does the large square represent here? A "largepage"? If yes,
which order? There seem to be quite some pixels in each square...
> The red dots (pinned) are dentries, page tables, kernel stacks,
> whatever kernel stuff, right?
>
> The green dots (movable) are mostly userspace pages being mapped
> there, right?
If the largepage is the square, there can't be red pixels mixed with
green pixels with the config-page-shift design, this is the whole
difference...
zooming in I see red pixels all over the squares mized with green
pixels in the same square. This is exactly what happens with the
variable order page cache and that's why it provides zero guarantees
in terms of how much ram is really "free" (free as in "available").
> What I was refering too is that because movable objects (green dots)
> aren't moved out of a mixed group (the boxes) when some unmovable
> object needs space all the groups become mixed over time. That means
> the unmovable objects are spread out over all the ram and the buddy
> system can't recombine regions when unmovable objects free them. There
> will nearly always be some movable objects in the other buddy. The
> system of having unmovable and movable groups breaks down and becomes
> useless.
If I understood correctly, here you agree that mixing movable and
unmovable objects in the same largepage is a bad thing, and that's
incidentally what config-page-shift prevents. It avoids it instead of
undoing the mixture later with defrag when it's far too late for
anything but updatedb.
> I'm assuming here that we want the possibility of larger order pages
> for unmovable objects (large continiuos regions for DMA for example)
> than the smallest order user space gets (or any movable object). If
> mmap() still works on 4k page bounaries then those will fragment all
> regions into 4k chunks in the worst case.
With config-page-shift mmap works on 4k chunks but it's always backed
by 64k or any other largesize that you choosed at compile time. And if
the virtual alignment of mmap matches the physical alignment of the
physical largepage and is >= PAGE_SIZE (software PAGE_SIZE I mean) we
could use the 62nd bit of the pte to use a 64k tlb (if future cpus
will allow that). Nick also suggested to still set all ptes equal to
make life easier for the tlb miss microcode.
> Obviously if userspace has a minimum order of 64k chunks then it will
> never break any region smaller than 64k chunks and will never cause a
> fragmentation catastroph. I know that is verry roughly your aproach
> (make order 0 bigger), and I like it, but it has some limits as to how
Yep, exactly this is what happens, it avoids that trouble. But as far
as fragmentation guarantees goes, it's really about keeping the
unmovable out of our way (instead of spreading the unmovable all over
the buddy randomly, or with ugly
boot-time-fixed-numbers-memory-reservations) than to map largepages in
userland. Infact as I said we could map kmalloced 4k entries in
userland to save memory if we would really want to hurt the fast paths
to make a generic kernel to use on smaller systems, but that would be
very complex. Since those 4k entries would be 100% movable (not like
the rest of the slab, like dentries and inodes etc..) that wouldn't
make the design less reliable, it'd still be 100% reliable and
performance would be ok because that memory is userland memory, we've
to set the pte anyway, regardless if it's a 4k page or a largepage.
> big you can make it. I don't think my system with 1GB ram would work
> so well with 2MB order 0 pages. But I wasn't refering to that but to
> the picture.
Sure! 2M is sure way excessive for a 1G system, 64k most certainly
too, of course unless you're running a db or a multimedia streaming
service, in which case it should be ideal.
On Sun, 16 September 2007 00:30:32 +0200, Andrea Arcangeli wrote:
>
> Movable? I rather assume all slab allocations aren't movable. Then
> slab defrag can try to tackle on users like dcache and inodes. Keep in
> mind that with the exception of updatedb, those inodes/dentries will
> be pinned and you won't move them, which is why I prefer to consider
> them not movable too... since there's no guarantee they are.
I have been toying with the idea of having seperate caches for pinned
and movable dentries. Downside of such a patch would be the number of
memcpy() operations when moving dentries from one cache to the other.
Upside is that a fair amount of slab cache can be made movable.
memcpy() is still faster than reading an object from disk.
Most likely the current reaction to such a patch would be to shoot it
down due to overhead, so I didn't pursue it. All I have is an old patch
to seperate never-cached from possibly-cached dentries. It will
increase the odds of freeing a slab, but provide no guarantee.
But the point here is: dentries/inodes can be made movable if there are
clear advantages to it. Maybe they should?
Jörn
--
Joern's library part 2:
http://www.art.net/~hopkins/Don/unix-haters/tirix/embarrassing-memo.html
On Sat, 15 September 2007 01:44:49 -0700, Andrew Morton wrote:
> On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel <[email protected]> wrote:
>
> > While I agree with your concern, those numbers are quite silly. The
> > chances of 99.8% of pages being free and the remaining 0.2% being
> > perfectly spread across all 2MB large_pages are lower than those of SHA1
> > creating a collision.
>
> Actually it'd be pretty easy to craft an application which allocates seven
> pages for pagecache, then one for <something>, then seven for pagecache, then
> one for <something>, etc.
>
> I've had test apps which do that sort of thing accidentally. The result
> wasn't pretty.
I bet! My (false) assumption was the same as Goswin's. If non-movable
pages are clearly seperated from movable ones and will evict movable
ones before polluting further mixed superpages, Nick's scenario would be
nearly infinitely impossible.
Assumption doesn't reflect current code. Enforcing this assumption
would cost extra overhead. The amount of effort to make Christoph's
approach work reliably seems substantial and I have no idea whether it
would be worth it.
Jörn
--
Happiness isn't having what you want, it's wanting what you have.
-- unknown
On (15/09/07 14:14), Goswin von Brederlow didst pronounce:
> Andrew Morton <[email protected]> writes:
>
> > On Tue, 11 Sep 2007 14:12:26 +0200 J?rn Engel <[email protected]> wrote:
> >
> >> While I agree with your concern, those numbers are quite silly. The
> >> chances of 99.8% of pages being free and the remaining 0.2% being
> >> perfectly spread across all 2MB large_pages are lower than those of SHA1
> >> creating a collision.
> >
> > Actually it'd be pretty easy to craft an application which allocates seven
> > pages for pagecache, then one for <something>, then seven for pagecache, then
> > one for <something>, etc.
> >
> > I've had test apps which do that sort of thing accidentally. The result
> > wasn't pretty.
>
> Except that the applications 7 pages are movable and the <something>
> would have to be unmovable. And then they should not share the same
> memory region. At least they should never be allowed to interleave in
> such a pattern on a larger scale.
>
It is actually really easy to force regions to never share. At the
moment, there is a fallback list that determines a preference for what
block to mix.
The reason why this isn't enforced is the cost of moving. On x86 and
x86_64, a block of interest is usually 2MB or 4MB. Clearing out one of
those pages to prevent any mixing would be bad enough. On PowerPC, it's
potentially 16MB. On IA64, it's 1GB.
As this was fragmentation avoidance, not guarantees, the decision was
made to not strictly enforce the types of pages within a block as the
cost cannot be made back unless the system was making agressive use of
large pages. This is not the case with Linux.
> The only way a fragmentation catastroph can be (proovable) avoided is
> by having so few unmovable objects that size + max waste << ram
> size. The smaller the better. Allowing movable and unmovable objects
> to mix means that max waste goes way up. In your example waste would
> be 7*size. With 2MB uper order limit it would be 511*size.
>
> I keep coming back to the fact that movable objects should be moved
> out of the way for unmovable ones. Anything else just allows
> fragmentation to build up.
>
This is easily achieved, just really really expensive because of the
amount of copying that would have to take place. It would also compel
that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
a lot of free memory to keep around which is why fragmentation avoidance
doesn't do it.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On (15/09/07 17:51), Andrea Arcangeli didst pronounce:
> On Sat, Sep 15, 2007 at 02:14:42PM +0200, Goswin von Brederlow wrote:
> > I keep coming back to the fact that movable objects should be moved
> > out of the way for unmovable ones. Anything else just allows
>
> That's incidentally exactly what the slab does, no need to reinvent
> the wheel for that, it's an old problem and there's room for
> optimization in the slab partial-reuse logic too.
Except now as I've repeatadly pointed out, you have internal fragmentation
problems. If we went with the SLAB, we would need 16MB slabs on PowerPC for
example to get the same sort of results and a lot of copying and moving when
a suitable slab page was not available.
> Just boost the order
> 0 page size and use the slab to get the 4k chunks. The sgi/defrag
> design is backwards.
>
Nothing stops you altering the PAGE_SIZE so that large blocks work in
the way you envision and keep grouping pages by mobility for huge page
sizes.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Sun, 16 Sep 2007, J?rn Engel wrote:
>
> I have been toying with the idea of having seperate caches for pinned
> and movable dentries. Downside of such a patch would be the number of
> memcpy() operations when moving dentries from one cache to the other.
Totally inappropriate.
I bet 99% of all "dentry_lookup()" calls involve turning the last dentry
from having a count of zero ("movable") to having a count of 1 ("pinned").
So such an approach would fundamentally be broken. It would slow down all
normal dentry lookups, since the *common* case for leaf dentries is that
they have a zero count.
So it's much better to do it on a "directory/file" basis, on the
assumption that files are *mostly* movable (or just freeable). The fact
that they aren't always (ie while kept open etc), is likely statistically
not all that important.
Linus
On Sun, 16 September 2007 11:15:36 -0700, Linus Torvalds wrote:
> On Sun, 16 Sep 2007, Jörn Engel wrote:
> >
> > I have been toying with the idea of having seperate caches for pinned
> > and movable dentries. Downside of such a patch would be the number of
> > memcpy() operations when moving dentries from one cache to the other.
>
> Totally inappropriate.
>
> I bet 99% of all "dentry_lookup()" calls involve turning the last dentry
> from having a count of zero ("movable") to having a count of 1 ("pinned").
>
> So such an approach would fundamentally be broken. It would slow down all
> normal dentry lookups, since the *common* case for leaf dentries is that
> they have a zero count.
Why am I not surprised? :)
> So it's much better to do it on a "directory/file" basis, on the
> assumption that files are *mostly* movable (or just freeable). The fact
> that they aren't always (ie while kept open etc), is likely statistically
> not all that important.
My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc.
which are pinned for their entire lifetime and another for regular
files/inodes. One could take a three-way approach and have
always-pinned, often-pinned and rarely-pinned.
We won't get never-pinned that way.
Jörn
--
The wise man seeks everything in himself; the ignorant man tries to get
everything from somebody else.
-- unknown
On Sun, 16 Sep 2007, J?rn Engel wrote:
>
> My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc.
> which are pinned for their entire lifetime and another for regular
> files/inodes. One could take a three-way approach and have
> always-pinned, often-pinned and rarely-pinned.
>
> We won't get never-pinned that way.
That sounds pretty good. The problem, of course, is that most of the time,
the actual dentry allocation itself is done before you really know which
case the dentry will be in, and the natural place for actually giving the
dentry lifetime hint is *not* at "d_alloc()", but when we "instantiate"
it with d_add() or d_instantiate().
But it turns out that most of the filesystems we care about already use a
special case of "d_add()" that *already* replaces the dentry with another
one in some cases: "d_splice_alias()".
So I bet that if we just taught "d_splice_alias()" to look at the inode,
and based on the inode just re-allocate the dentry to some other slab
cache, we'd already handle a lot of the cases!
And yes, you'd end up with the reallocation overhead quite often, but at
least it would now happen only when filling in a dentry, not in the
(*much* more critical) cached lookup path.
Linus
On Sun, Sep 16, 2007 at 07:15:04PM +0100, Mel Gorman wrote:
> Except now as I've repeatadly pointed out, you have internal fragmentation
> problems. If we went with the SLAB, we would need 16MB slabs on PowerPC for
> example to get the same sort of results and a lot of copying and moving when
Well not sure about the 16MB number, since I'm unsure what the size of
the ram was. But clearly I agree there are fragmentation issues in the
slab too, there have always been, except they're much less severe, and
the slab is meant to deal with that regardless of the PAGE_SIZE. That
is not a new problem, you are introducing a new problem instead.
We can do a lot better than slab currently does without requiring any
defrag move-or-shrink at all.
slab is trying to defrag memory for small objects at nearly zero cost,
by not giving pages away randomly. I thought you agreed that solving
the slab fragmentation was going to provide better guarantees when in
another email you suggested that you could start allocating order > 0
pages in the slab to reduce the fragmentation (to achieve most of the
guarantee provided by config-page-shift, but while still keeping the
order 0 at 4k for whatever reason I can't see).
> a suitable slab page was not available.
You ignore one other bit, when "/usr/bin/free" says 1G is free, with
config-page-shift it's free no matter what, same goes for not mlocked
cache. With variable order page cache, /usr/bin/free becomes mostly a
lie as long as there's no 4k fallback (like fsblock).
And most important you're only tackling on the pagecache and I/O
performance with the inefficient I/O devices, the whole kernel has no
cahnce to get a speedup, infact you're making the fast paths slower,
just the opposite of config-page-shift and original Hugh's large
PAGE_SIZE ;).
On (16/09/07 20:50), Andrea Arcangeli didst pronounce:
> On Sun, Sep 16, 2007 at 07:15:04PM +0100, Mel Gorman wrote:
> > Except now as I've repeatadly pointed out, you have internal fragmentation
> > problems. If we went with the SLAB, we would need 16MB slabs on PowerPC for
> > example to get the same sort of results and a lot of copying and moving when
>
> Well not sure about the 16MB number, since I'm unsure what the size of
> the ram was.
The 16MB is the size of a hugepage, the size of interest as far as I am
concerned. Your idea makes sense for large block support, but much less
for huge pages because you are incurring a cost in the general case for
something that may not be used.
There is nothing to say that both can't be done. Raise the size of
order-0 for large block support and continue trying to group the block
to make hugepage allocations likely succeed during the lifetime of the
system.
> But clearly I agree there are fragmentation issues in the
> slab too, there have always been, except they're much less severe, and
> the slab is meant to deal with that regardless of the PAGE_SIZE. That
> is not a new problem, you are introducing a new problem instead.
>
At the risk of repeating, your approach will be adding a new and
significant dimension to the internal fragmentation problem where a
kernel allocation may fail because the larger order-0 pages are all
being pinned by userspace pages.
> We can do a lot better than slab currently does without requiring any
> defrag move-or-shrink at all.
>
> slab is trying to defrag memory for small objects at nearly zero cost,
> by not giving pages away randomly. I thought you agreed that solving
> the slab fragmentation was going to provide better guarantees when in
It improves the probabilty of hugepage allocations working because the
blocks with slab pages can be targetted and cleared if necessary.
> another email you suggested that you could start allocating order > 0
> pages in the slab to reduce the fragmentation (to achieve most of the
> guarantee provided by config-page-shift, but while still keeping the
> order 0 at 4k for whatever reason I can't see).
>
That suggestion was aimed at the large block support more than
hugepages. It helps large blocks because we'll be allocating and freeing
as more or less the same size. It certainly is easier to set
slub_min_order to the same size as what is needed for large blocks in
the system than introducing the necessary mechanics to allocate
pagetable pages and userspace pages from slab.
> > a suitable slab page was not available.
>
> You ignore one other bit, when "/usr/bin/free" says 1G is free, with
> config-page-shift it's free no matter what, same goes for not mlocked
> cache. With variable order page cache, /usr/bin/free becomes mostly a
> lie as long as there's no 4k fallback (like fsblock).
>
I'm not sure what you are getting at here. I think it would make more
sense if you said "when you read /proc/buddyinfo, you know the order-0
pages are really free for use with large blocks" with your approach.
> And most important you're only tackling on the pagecache and I/O
> performance with the inefficient I/O devices, the whole kernel has no
> cahnce to get a speedup, infact you're making the fast paths slower,
> just the opposite of config-page-shift and original Hugh's large
> PAGE_SIZE ;).
>
As the kernel pages are all getting grouped together, there are fewer TLB
entries needed to address the kernels working set so there is a general
improvement although how much is workload
All this aside, there is nothing mutually exclusive with what you are proposing
and what grouping pages by mobility does. Your stuff can exist even if grouping
pages by mobility is in place. In fact, it'll help us by giving an important
comparison point as grouping pages by mobility can be trivially disabled with
a one-liner for the purposes of testing. If your approach is brought to being
a general solution that also helps hugepage allocation, then we can revisit
grouping pages by mobility by comparing kernels with it enabled and without.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On (16/09/07 17:08), Andrea Arcangeli didst pronounce:
> On Sun, Sep 16, 2007 at 03:54:56PM +0200, Goswin von Brederlow wrote:
> > Andrea Arcangeli <[email protected]> writes:
> >
> > > On Sat, Sep 15, 2007 at 10:14:44PM +0200, Goswin von Brederlow wrote:
> > >> - Userspace allocates a lot of memory in those slabs.
> > >
> > > If with slabs you mean slab/slub, I can't follow, there has never been
> > > a single byte of userland memory allocated there since ever the slab
> > > existed in linux.
> >
> > This and other comments in your reply show me that you completly
> > misunderstood what I was talking about.
> >
> > Look at
> > http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg
>
> What does the large square represent here? A "largepage"? If yes,
> which order? There seem to be quite some pixels in each square...
>
hmm, it was a long time ago. This one looks like 4MB large pages so
order-10
> > The red dots (pinned) are dentries, page tables, kernel stacks,
> > whatever kernel stuff, right?
> >
> > The green dots (movable) are mostly userspace pages being mapped
> > there, right?
>
> If the largepage is the square, there can't be red pixels mixed with
> green pixels with the config-page-shift design, this is the whole
> difference...
Yes. I can enforce a similar situation but didn't because the evacuation
costs could not be justified for hugepage allocations. Patches to do such
a thing were prototyped a long time ago and abandoned based on cost. For
large blocks, there may be a justification.
> zooming in I see red pixels all over the squares mized with green
> pixels in the same square. This is exactly what happens with the
> variable order page cache and that's why it provides zero guarantees
> in terms of how much ram is really "free" (free as in "available").
>
This picture is not grouping pages by mobility so that is hardly a
suprise. This picture is not running grouping pages by mobility. This is
what the normal kernel looks like. Look at the videos in
http://www.skynet.ie/~mel/anti-frag/2007-02-28 and see how list-based
compares to vanilla. These are from February when there was less control
over mixing blocks than there is today.
In the current version mixing occurs in the lower blocks as much as possible
not the upper ones. So there are a number of mixed blocks but the number is
kept to a minimum.
The number of mixed blocks could have been enforced as 0, but I felt it was
better in the general case to fragment rather than regress performance.
That may be different for large blocks where you will want to take the
enforcement steps.
> > What I was refering too is that because movable objects (green dots)
> > aren't moved out of a mixed group (the boxes) when some unmovable
> > object needs space all the groups become mixed over time. That means
> > the unmovable objects are spread out over all the ram and the buddy
> > system can't recombine regions when unmovable objects free them. There
> > will nearly always be some movable objects in the other buddy. The
> > system of having unmovable and movable groups breaks down and becomes
> > useless.
>
> If I understood correctly, here you agree that mixing movable and
> unmovable objects in the same largepage is a bad thing, and that's
> incidentally what config-page-shift prevents. It avoids it instead of
> undoing the mixture later with defrag when it's far too late for
> anything but updatedb.
>
We don't take defrag steps at the moment at all. There are memory
compaction patches but I'm not pushing them until we can prove they are
necessary.
> > I'm assuming here that we want the possibility of larger order pages
> > for unmovable objects (large continiuos regions for DMA for example)
> > than the smallest order user space gets (or any movable object). If
> > mmap() still works on 4k page bounaries then those will fragment all
> > regions into 4k chunks in the worst case.
>
> With config-page-shift mmap works on 4k chunks but it's always backed
> by 64k or any other largesize that you choosed at compile time. And if
> the virtual alignment of mmap matches the physical alignment of the
> physical largepage and is >= PAGE_SIZE (software PAGE_SIZE I mean) we
> could use the 62nd bit of the pte to use a 64k tlb (if future cpus
> will allow that). Nick also suggested to still set all ptes equal to
> make life easier for the tlb miss microcode.
>
As I said elsewhere, you can try this easily on top of grouping pages by
mobility. They are not mutually exclusive and you'll have a comparison
point.
> > Obviously if userspace has a minimum order of 64k chunks then it will
> > never break any region smaller than 64k chunks and will never cause a
> > fragmentation catastroph. I know that is verry roughly your aproach
> > (make order 0 bigger), and I like it, but it has some limits as to how
>
> Yep, exactly this is what happens, it avoids that trouble. But as far
> as fragmentation guarantees goes, it's really about keeping the
> unmovable out of our way (instead of spreading the unmovable all over
> the buddy randomly, or with ugly
> boot-time-fixed-numbers-memory-reservations) than to map largepages in
> userland. Infact as I said we could map kmalloced 4k entries in
> userland to save memory if we would really want to hurt the fast paths
> to make a generic kernel to use on smaller systems, but that would be
> very complex. Since those 4k entries would be 100% movable (not like
> the rest of the slab, like dentries and inodes etc..) that wouldn't
> make the design less reliable, it'd still be 100% reliable and
> performance would be ok because that memory is userland memory, we've
> to set the pte anyway, regardless if it's a 4k page or a largepage.
>
Ok, get it implemented so and we'll try it out because we're just hand-waving
here and not actually producing anything to compare. It'll be interesting
to see how it works out for large blocks and hugepages (although I expect
the latter to fail unless grouping pages by mobility is in place). Ideally,
they'll complement each other nicely but only ever having mixing take place
at the 64KB boundary. I have the testing setup necessary for checking
out hugepages at least and I hope to put together something that tests
large blocks as well. Minimally, running the hugepage allocation tests
on a filesystem using large blocks would be a decent starting point.
> > big you can make it. I don't think my system with 1GB ram would work
> > so well with 2MB order 0 pages. But I wasn't refering to that but to
> > the picture.
>
> Sure! 2M is sure way excessive for a 1G system, 64k most certainly
> too, of course unless you're running a db or a multimedia streaming
> service, in which case it should be ideal.
>
--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
> Mel Gorman <[email protected]> writes:
>
> > On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
> >> Nick Piggin <[email protected]> writes:
> >>
> >> > In my attack, I cause the kernel to allocate lots of unmovable allocations
> >> > and deplete movable groups. I theoretically then only need to keep a
> >> > small number (1/2^N) of these allocations around in order to DoS a
> >> > page allocation of order N.
> >>
> >> I'm assuming that when an unmovable allocation hijacks a movable group
> >> any further unmovable alloc will evict movable objects out of that
> >> group before hijacking another one. right?
> >>
> >
> > No eviction takes place. If an unmovable allocation gets placed in a
> > movable group, then steps are taken to ensure that future unmovable
> > allocations will take place in the same range (these decisions take
> > place in __rmqueue_fallback()). When choosing a movable block to
> > pollute, it will also choose the lowest possible block in PFN terms to
> > steal so that fragmentation pollution will be as confined as possible.
> > Evicting the unmovable pages would be one of those expensive steps that
> > have been avoided to date.
>
> But then you can have all blocks filled with movable data, free 4K in
> one group, allocate 4K unmovable to take over the group, free 4k in
> the next group, take that group and so on. You can end with 4k
> unmovable in every 64k easily by accident.
>
As the mixing takes place at the lowest possible block, it's
exceptionally difficult to trigger this. Possible, but exceptionally
difficult.
As I have stated repeatedly, the guarantees can be made but potential
hugepage allocation did not justify it. Large blocks might.
> There should be a lot of preassure for movable objects to vacate a
> mixed group or you do get fragmentation catastrophs.
We (Andy Whitcroft and I) did implement something like that. It hooked into
kswapd to clean mixed blocks. If the caller could do the cleaning, it
did the work instead of kswapd.
> Looking at my
> little test program evicting movable objects from a mixed group should
> not be that expensive as it doesn't happen often.
It happens regularly if the size of the block you need to keep clean is
lower than min_free_kbytes. In the case of hugepages, that was always
the case.
> The cost of it
> should be freeing some pages (or finding free ones in a movable group)
> and then memcpy.
Freeing pages is not cheap. Copying pages is cheaper but not cheap.
> With my simplified simulation it never happens so I
> expect it to only happen when the work set changes.
>
> >> > And it doesn't even have to be a DoS. The natural fragmentation
> >> > that occurs today in a kernel today has the possibility to slowly push out
> >> > the movable groups and give you the same situation.
> >>
> >> How would you cause that? Say you do want to purposefully place one
> >> unmovable 4k page into every 64k compund page. So you allocate
> >> 4K. First 64k page locked. But now, to get 4K into the second 64K page
> >> you have to first use up all the rest of the first 64k page. Meaning
> >> one 4k chunk, one 8k chunk, one 16k cunk, one 32k chunk. Only then
> >> will a new 64k chunk be broken and become locked.
> >
> > It would be easier early in the boot to mmap a large area and fault it
> > in in virtual address order then mlock every a page every 64K. Early in
> > the systems lifetime, there will be a rough correlation between physical
> > and virtual memory.
> >
> > Without mlock(), the most successful attack will like mmap() a 60K
> > region and fault it in as an attempt to get pagetable pages placed in
> > every 64K region. This strategy would not work with grouping pages by
> > mobility though as it would group the pagetable pages together.
>
> But even with mlock the virtual pages should still be movable.
They are. The Memory Compaction patches were able to do the job.
> So if
> you evict movable objects from mixed group when needed all the
> pagetable pages would end up in the same mixed group slowly taking it
> over completly. No fragmentation at all. See how essential that
> feature is. :)
>
To move pages, there must be enough blocks free. That is where
min_free_kbytes had to come in. If you cared only about keeping 64KB
chunks free, it makes sense but it didn't in the context of hugepages.
> > Targetted attacks on grouping pages by mobility are not very easy and
> > not that interesting either. As Nick suggests, the natural fragmentation
> > over long periods of time is what is interesting.
> >
> >> So to get the last 64k chunk used all previous 32k chunks need to be
> >> blocked and you need to allocate 32k (or less if more is blocked). For
> >> all previous 32k chunks to be blocked every second 16k needs to be
> >> blocked. To block the last of those 16k chunks all previous 8k chunks
> >> need to be blocked and you need to allocate 8k. For all previous 8k
> >> chunks to be blocked every second 4k page needs to be used. To alloc
> >> the last of those 4k pages all previous 4k pages need to be used.
> >>
> >> So to construct a situation where no continious 64k chunk is free you
> >> have to allocate <total mem> - 64k - 32k - 16k - 8k - 4k (or there
> >> about) of memory first. Only then could you free memory again while
> >> still keeping every 64k page blocked. Does that occur naturally given
> >> enough ram to start with?
> >>
> >
> > I believe it's very difficult to craft an attack that will work in a
> > short period of time. An attack that worked on 2.6.22 as well may have
> > no success on 2.6.23-rc4-mm1 for example as grouping pages by mobility
> > does it make it exceedingly hard to craft an attack unless the attacker
> > can mlock large amounts of memory.
> >
> >>
> >> Too see how bad fragmentation could be I wrote a little progamm to
> >> simulate allocations with the following simplified alogrithm:
> >>
> >> Memory management:
> >> - Free pages are kept in buckets, one per order, and sorted by address.
> >> - alloc() the front page (smallest address) out of the bucket of the
> >> right order or recursively splits the next higher bucket.
> >> - free() recursively tries to merge a page with its neighbour and puts
> >> the result back into the proper bucket (sorted by address).
> >>
> >> Allocation and lifetime:
> >> - Every tick a new page is allocated with random order.
> >
> > This step in itself is not representative of what happens in the kernel.
> > The vast vast majority of allocations are order-0. It's a fun analysis
> > but I'm not sure can we draw any conclusions from it.
>
> I skewed the distribution to that end. Maybe not enough but I wanted
> to get quite a big of large pages. Also I'm only simulating the
> unmovable objects and I would expect the I/O layer to make a lot of
> higher order allocs/free to benefit from compound pages. If nobody
> uses them then what is the point adding them?
>
Ok. It's a bit fairer if we pick two orders that are commonly used then.
order-0 for almost everything and order-4 for a hypothetical large block
filesystem that is mounted.
> > Statistical analysis of the buddy algorithm have implied that it doesn't
> > suffer that badly from external fragmentation but we know in practice
> > that things are different. A model is hard because minimally the
> > lifetime of pages varies widely.
> >
> >> - The order is a triangle distribution with max at 0 (throw 2 dice,
> >> add the eyes, subtract 7, abs() the number).
> >> - The page is scheduled to be freed after X ticks. Where X is nearly
> >> a gaus curve centered at 0 and maximum at <total num pages> * 1.5.
> >> (What I actualy do is throw 8 dice and sum them up and shift the
> >> result.)
> >>
> >
> > I doubt this is how the kernel behaves either.
>
> I had to pick something. I agree that the lifetime part is the hardest
> to simulate and the point where an attack would start.
>
That's fair.
> >> Display:
> >> I start with a white window. Every page allocation draws a black box
> >> from the address of the page and as wide as the page is big (-1 pixel to
> >> give a seperation to the next page). Every page free draws a yellow
> >> box in place of the black one. Yellow to show where a page was in use
> >> at one point while white means the page was never used.
> >>
> >> As the time ticks the memory fills up. Quickly at first and then comes
> >> to a stop around 80% filled. And then something interesting
> >> happens. The yellow regions (previously used but now free) start
> >> drifting up. Small pages tend to end up in the lower addresses and big
> >> pages at the higher addresses. The memory defragments itself to some
> >> degree.
> >>
> >> http://mrvn.homeip.net/fragment/
> >>
> >> Simulating 256MB ram and after 1472943 ticks and 530095 4k, 411841 8k,
> >> 295296 16k, 176647 32k and 59064 64k allocations you get this:
> >> http://mrvn.homeip.net/fragment/256mb.png
> >>
> >> Simulating 1GB ram and after 5881185 ticks and 2116671 4k, 1645957
> >> 8k, 1176994 16k, 705873 32k and 235690 64k allocations you get this:
> >> http://mrvn.homeip.net/fragment/1gb.png
> >>
> >
> > These type of pictures feel somewhat familiar
> > (http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg).
>
> Those look a lot better and look like they are actually real kernel
> data. How did you make them and can one create them in real-time (say
> once a second or so)?
>
It's from a real kernel. When I was measuring this stuff, I took a
sample every 2 seconds.
> There seem to be an awfull lot of pinned pages inbetween the movable.
It wasn't grouping by mobility at the time.
> I would verry much like to see the same data with evicting of movable
> pages out of mixed groups. I see not a single movable group while with
> strict eviction there could be at most one mixed group per order.
>
With strict eviction, there would be no mixed blocks period.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
> The 16MB is the size of a hugepage, the size of interest as far as I am
> concerned. Your idea makes sense for large block support, but much less
> for huge pages because you are incurring a cost in the general case for
> something that may not be used.
Sorry for the misunderstanding, I totally agree!
> There is nothing to say that both can't be done. Raise the size of
> order-0 for large block support and continue trying to group the block
> to make hugepage allocations likely succeed during the lifetime of the
> system.
Sure, I completely agree.
> At the risk of repeating, your approach will be adding a new and
> significant dimension to the internal fragmentation problem where a
> kernel allocation may fail because the larger order-0 pages are all
> being pinned by userspace pages.
This is exactly correct, some memory will be wasted. It'll reach 0
free memory more quickly depending on which kind of applications are
being run.
> It improves the probabilty of hugepage allocations working because the
> blocks with slab pages can be targetted and cleared if necessary.
Agreed.
> That suggestion was aimed at the large block support more than
> hugepages. It helps large blocks because we'll be allocating and freeing
> as more or less the same size. It certainly is easier to set
> slub_min_order to the same size as what is needed for large blocks in
> the system than introducing the necessary mechanics to allocate
> pagetable pages and userspace pages from slab.
Allocating userpages from slab in 4k chunks with a 64k PAGE_SIZE is
really complex indeed. I'm not planning for that in the short term but
it remains a possibility to make the kernel more generic. Perhaps it
could worth it...
Allocating ptes from slab is fairly simple but I think it would be
better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
nearby ptes in the per-task local pagetable tree, to reduce the number
of locks taken and not to enter the slab at all for that. Infact we
could allocate the 4 levels (or anyway more than one level) in one
single alloc_pages(0) and track the leftovers in the mm (or similar).
> I'm not sure what you are getting at here. I think it would make more
> sense if you said "when you read /proc/buddyinfo, you know the order-0
> pages are really free for use with large blocks" with your approach.
I'm unsure who reads /proc/buddyinfo (that can change a lot and that
is not very significant information if the vm can defrag well inside
the reclaim code), but it's not much different and it's more about
knowing the real meaning of /proc/meminfo, freeable (unmapped) cache,
anon ram, and free memory.
The idea is that to succeed an mmap over a large xfs file with
mlockall being invoked, those largepages must become available or
it'll be oom despite there are still 512M free... I'm quite sure
admins will gets confused if they get oom killer invoked with lots of
ram still free.
The overcommit feature will also break, just to make an example (so
much for overcommit 2 guaranteeing -ENOMEM retvals instead of oom
killage ;).
> All this aside, there is nothing mutually exclusive with what you are proposing
> and what grouping pages by mobility does. Your stuff can exist even if grouping
> pages by mobility is in place. In fact, it'll help us by giving an important
> comparison point as grouping pages by mobility can be trivially disabled with
> a one-liner for the purposes of testing. If your approach is brought to being
> a general solution that also helps hugepage allocation, then we can revisit
> grouping pages by mobility by comparing kernels with it enabled and without.
Yes, I totally agree. It sounds worthwhile to have a good defrag logic
in the VM. Even allocating the kernel stack in today kernels should be
able to benefit from your work. It's just comparing a fork() failure
(something that will happen with ulimit -n too and that apps must be
able to deal with) with an I/O failure that worries me a bit. I'm
quite sure a db failing I/O will not recover too nicely. If fork fails
that's most certainly ok... at worst a new client won't be able to
connect and he can retry later. Plus order 1 isn't really a big deal,
you know the probability to succeeds decreases exponentially with the
order.
On (16/09/07 19:53), J?rn Engel didst pronounce:
> On Sat, 15 September 2007 01:44:49 -0700, Andrew Morton wrote:
> > On Tue, 11 Sep 2007 14:12:26 +0200 J?rn Engel <[email protected]> wrote:
> >
> > > While I agree with your concern, those numbers are quite silly. The
> > > chances of 99.8% of pages being free and the remaining 0.2% being
> > > perfectly spread across all 2MB large_pages are lower than those of SHA1
> > > creating a collision.
> >
> > Actually it'd be pretty easy to craft an application which allocates seven
> > pages for pagecache, then one for <something>, then seven for pagecache, then
> > one for <something>, etc.
> >
> > I've had test apps which do that sort of thing accidentally. The result
> > wasn't pretty.
>
> I bet! My (false) assumption was the same as Goswin's. If non-movable
> pages are clearly seperated from movable ones and will evict movable
> ones before polluting further mixed superpages, Nick's scenario would be
> nearly infinitely impossible.
>
It would be plain impossible from a fragmentation point-of-view but you
meet interesting situations when a GFP_NOFS allocation has no kernel blocks
available to use. It can't reclaim, maybe it can move but not with current
code (it should be able to with the Memory Compaction patches).
> Assumption doesn't reflect current code. Enforcing this assumption
> would cost extra overhead. The amount of effort to make Christoph's
> approach work reliably seems substantial and I have no idea whether it
> would be worth it.
>
Current code doesn't reflect your assumptions simply because the costs are so
high. We'd need to be really sure it's worth it and if the answer is "yes",
then we are looking at Andrea's approach (more likely) or I can check out
evicting blocks of 16KB, 64KB or whatever the large block is.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
[email protected] (Mel Gorman) writes:
> On (15/09/07 14:14), Goswin von Brederlow didst pronounce:
>> Andrew Morton <[email protected]> writes:
>>
>> > On Tue, 11 Sep 2007 14:12:26 +0200 J?rn Engel <[email protected]> wrote:
>> >
>> >> While I agree with your concern, those numbers are quite silly. The
>> >> chances of 99.8% of pages being free and the remaining 0.2% being
>> >> perfectly spread across all 2MB large_pages are lower than those of SHA1
>> >> creating a collision.
>> >
>> > Actually it'd be pretty easy to craft an application which allocates seven
>> > pages for pagecache, then one for <something>, then seven for pagecache, then
>> > one for <something>, etc.
>> >
>> > I've had test apps which do that sort of thing accidentally. The result
>> > wasn't pretty.
>>
>> Except that the applications 7 pages are movable and the <something>
>> would have to be unmovable. And then they should not share the same
>> memory region. At least they should never be allowed to interleave in
>> such a pattern on a larger scale.
>>
>
> It is actually really easy to force regions to never share. At the
> moment, there is a fallback list that determines a preference for what
> block to mix.
>
> The reason why this isn't enforced is the cost of moving. On x86 and
> x86_64, a block of interest is usually 2MB or 4MB. Clearing out one of
> those pages to prevent any mixing would be bad enough. On PowerPC, it's
> potentially 16MB. On IA64, it's 1GB.
>
> As this was fragmentation avoidance, not guarantees, the decision was
> made to not strictly enforce the types of pages within a block as the
> cost cannot be made back unless the system was making agressive use of
> large pages. This is not the case with Linux.
I don't say the group should never be mixed. The movable objects could
be moved out on demand. If 64k get allocated then up to 64k get
moved. That would reduce the impact as the kernel does not hang while
it moves 2MB or even 1GB. It also allows objects to be freed and the
space reused in the unmovable and mixed groups. There could also be a
certain number or percentage of mixed groupd be allowed to further
increase the chance of movable objects freeing themself from mixed
groups.
But when you already have say 10% of the ram in mixed groups then it
is a sign the external fragmentation happens and some time should be
spend on moving movable objects.
>> The only way a fragmentation catastroph can be (proovable) avoided is
>> by having so few unmovable objects that size + max waste << ram
>> size. The smaller the better. Allowing movable and unmovable objects
>> to mix means that max waste goes way up. In your example waste would
>> be 7*size. With 2MB uper order limit it would be 511*size.
>>
>> I keep coming back to the fact that movable objects should be moved
>> out of the way for unmovable ones. Anything else just allows
>> fragmentation to build up.
>>
>
> This is easily achieved, just really really expensive because of the
> amount of copying that would have to take place. It would also compel
> that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
> MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
> a lot of free memory to keep around which is why fragmentation avoidance
> doesn't do it.
In your sample graphics you had 1152 groups. Reserving a few of those
doesnt sound too bad. And how many migrate types do we talk about. So
far we only had movable and unmovable. I would split unmovable into
short term (caches, I/O pages) and long term (task structures,
dentries). Reserving 6 groups for schort term unmovable and long term
unmovable would be 1% of ram in your situation.
Maybe instead of reserving one could say that you can have up to 6
groups of space not used by unmovable objects before aggressive moving
starts. I don't quite see why you NEED reserving as long as there is
enough space free alltogether in case something needs moving. 1 group
worth of space free might be plenty to move stuff too. Note that all
the virtual pages can be stuffed in every little free space there is
and reassembled by the MMU. There is no space lost there.
But until one tries one can't say.
MfG
Goswin
PS: How do allocations pick groups? Could one use the oldest group
dedicated to each MIGRATE_TYPE? Or lowest address for unmovable and
highest address for movable? Something to better keep the two out of
each other way.
Jörn Engel <[email protected]> writes:
> On Sun, 16 September 2007 00:30:32 +0200, Andrea Arcangeli wrote:
>>
>> Movable? I rather assume all slab allocations aren't movable. Then
>> slab defrag can try to tackle on users like dcache and inodes. Keep in
>> mind that with the exception of updatedb, those inodes/dentries will
>> be pinned and you won't move them, which is why I prefer to consider
>> them not movable too... since there's no guarantee they are.
>
> I have been toying with the idea of having seperate caches for pinned
> and movable dentries. Downside of such a patch would be the number of
> memcpy() operations when moving dentries from one cache to the other.
> Upside is that a fair amount of slab cache can be made movable.
> memcpy() is still faster than reading an object from disk.
How probable is it that the dentry is needed again? If you copy it and
it is not needed then you wasted time. If you throw it out and it is
needed then you wasted time too. Depending on the probability one of
the two is cheaper overall. Idealy I would throw away dentries that
haven't been accessed recently and copy recently used ones.
How much of a systems ram is spend on dentires? How much on task
structures? Does anyone have some stats on that? If it is <10% of the
total ram combined then I don't see much point in moving them. Just
keep them out of the way of users memory so the buddy system can work
effectively.
> Most likely the current reaction to such a patch would be to shoot it
> down due to overhead, so I didn't pursue it. All I have is an old patch
> to seperate never-cached from possibly-cached dentries. It will
> increase the odds of freeing a slab, but provide no guarantee.
>
> But the point here is: dentries/inodes can be made movable if there are
> clear advantages to it. Maybe they should?
>
> Jörn
MfG
Goswin
[email protected] (Mel Gorman) writes:
> On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
>> Mel Gorman <[email protected]> writes:
>>
>> > On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
>> >> Nick Piggin <[email protected]> writes:
>> >>
>> >> > In my attack, I cause the kernel to allocate lots of unmovable allocations
>> >> > and deplete movable groups. I theoretically then only need to keep a
>> >> > small number (1/2^N) of these allocations around in order to DoS a
>> >> > page allocation of order N.
>> >>
>> >> I'm assuming that when an unmovable allocation hijacks a movable group
>> >> any further unmovable alloc will evict movable objects out of that
>> >> group before hijacking another one. right?
>> >>
>> >
>> > No eviction takes place. If an unmovable allocation gets placed in a
>> > movable group, then steps are taken to ensure that future unmovable
>> > allocations will take place in the same range (these decisions take
>> > place in __rmqueue_fallback()). When choosing a movable block to
>> > pollute, it will also choose the lowest possible block in PFN terms to
>> > steal so that fragmentation pollution will be as confined as possible.
>> > Evicting the unmovable pages would be one of those expensive steps that
>> > have been avoided to date.
>>
>> But then you can have all blocks filled with movable data, free 4K in
>> one group, allocate 4K unmovable to take over the group, free 4k in
>> the next group, take that group and so on. You can end with 4k
>> unmovable in every 64k easily by accident.
>>
>
> As the mixing takes place at the lowest possible block, it's
> exceptionally difficult to trigger this. Possible, but exceptionally
> difficult.
Why is it difficult?
When user space allocates memory wouldn't it get it contiously? I mean
that is one of the goals, to use larger continious allocations and map
them with a single page table entry where possible, right? And then
you can roughly predict where an munmap() would free a page.
Say the application does map a few GB of file, uses madvice to tell
the kernel it needs a 2MB block (to get a continious 2MB chunk
mapped), waits for it and then munmaps 4K in there. A 4k hole for some
unmovable object to fill. If you can then trigger the creation of an
unmovable object as well (stat some file?) and loop you will fill the
ram quickly. Maybe it only works in 10% but then you just do it 10
times as often.
Over long times it could occur naturally. This is just to demonstrate
it with malice.
> As I have stated repeatedly, the guarantees can be made but potential
> hugepage allocation did not justify it. Large blocks might.
>
>> There should be a lot of preassure for movable objects to vacate a
>> mixed group or you do get fragmentation catastrophs.
>
> We (Andy Whitcroft and I) did implement something like that. It hooked into
> kswapd to clean mixed blocks. If the caller could do the cleaning, it
> did the work instead of kswapd.
Do you have a graphic like
http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg
for that case?
>> Looking at my
>> little test program evicting movable objects from a mixed group should
>> not be that expensive as it doesn't happen often.
>
> It happens regularly if the size of the block you need to keep clean is
> lower than min_free_kbytes. In the case of hugepages, that was always
> the case.
That assumes that the number of groups allocated for unmovable objects
will continiously grow and shrink. I'm assuming it will level off at
some size for long times (hours) under normal operations. There should
be some buffering of a few groups to be held back in reserve when it
shrinks to prevent the scenario that the size is just at a group
boundary and always grows/shrinks by 1 group.
>> The cost of it
>> should be freeing some pages (or finding free ones in a movable group)
>> and then memcpy.
>
> Freeing pages is not cheap. Copying pages is cheaper but not cheap.
To copy you need a free page as destination. Thats all I
ment. Hopefully there will always be a free one and the actual freeing
is done asynchronously from the copying.
>> So if
>> you evict movable objects from mixed group when needed all the
>> pagetable pages would end up in the same mixed group slowly taking it
>> over completly. No fragmentation at all. See how essential that
>> feature is. :)
>>
>
> To move pages, there must be enough blocks free. That is where
> min_free_kbytes had to come in. If you cared only about keeping 64KB
> chunks free, it makes sense but it didn't in the context of hugepages.
I'm more concerned with keeping the little unmovable things out of the
way. Those are the things that will fragment the memory and prevent
any huge pages to be available even with moving other stuff out of the
way.
It would also already be a big plus to have 64k continious chunks for
many operations. Guaranty that the filesystem and block layers can
always get such a page (by means of copying pages out of the way when
needed) and do even larger pages speculative.
But as you say that is where min_free_kbytes comes in. To have the
chance of a 2MB continious free chunk you need to have much more
reserved free. Something I would certainly do on a 16GB ram
server. Note that the buddy system will be better at having large free
chunks if user space allocates and frees large chunks as well. It is
much easier to make an 128K chunk out of 64K chunks than out of 4K
chunks. Much more probable to get 2 64k chunks adjacent than 32 4K
chunks in series.
>> > These type of pictures feel somewhat familiar
>> > (http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg).
>>
>> Those look a lot better and look like they are actually real kernel
>> data. How did you make them and can one create them in real-time (say
>> once a second or so)?
>>
>
> It's from a real kernel. When I was measuring this stuff, I took a
> sample every 2 seconds.
Can you tell me how? I would like to do the same.
>> There seem to be an awfull lot of pinned pages inbetween the movable.
>
> It wasn't grouping by mobility at the time.
That might explain at lot. I was thinking that was with grouping. But
without grouping such a rather random distribution of unmovable
objects doesn't sound uncommon,.
>> I would verry much like to see the same data with evicting of movable
>> pages out of mixed groups. I see not a single movable group while with
>> strict eviction there could be at most one mixed group per order.
>>
>
> With strict eviction, there would be no mixed blocks period.
I ment on demand eviction. Not evicting the whole group just because
we need 4k of it. Even looser would be to allow say 16 mixed groups
before moving pages out of the way when needed for an alloc. Best to
make it a proc entry and then change the value once a day and see how
it behaves. :)
MfG
Goswin
On Mon, 17 September 2007 00:06:24 +0200, Goswin von Brederlow wrote:
>
> How probable is it that the dentry is needed again? If you copy it and
> it is not needed then you wasted time. If you throw it out and it is
> needed then you wasted time too. Depending on the probability one of
> the two is cheaper overall. Idealy I would throw away dentries that
> haven't been accessed recently and copy recently used ones.
>
> How much of a systems ram is spend on dentires? How much on task
> structures? Does anyone have some stats on that? If it is <10% of the
> total ram combined then I don't see much point in moving them. Just
> keep them out of the way of users memory so the buddy system can work
> effectively.
As usual, the answer is "it depends". I've had up to 600MB in dentry
and inode slabs on a 1GiB machine after updatedb. This machine
currently has 13MB in dentries, which seems to be reasonable for my
purposes.
Jörn
--
Audacity augments courage; hesitation, fear.
-- Publilius Syrus
Linus Torvalds <[email protected]> writes:
> On Sun, 16 Sep 2007, J?rn Engel wrote:
>>
>> My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc.
>> which are pinned for their entire lifetime and another for regular
>> files/inodes. One could take a three-way approach and have
>> always-pinned, often-pinned and rarely-pinned.
>>
>> We won't get never-pinned that way.
>
> That sounds pretty good. The problem, of course, is that most of the time,
> the actual dentry allocation itself is done before you really know which
> case the dentry will be in, and the natural place for actually giving the
> dentry lifetime hint is *not* at "d_alloc()", but when we "instantiate"
> it with d_add() or d_instantiate().
>
> But it turns out that most of the filesystems we care about already use a
> special case of "d_add()" that *already* replaces the dentry with another
> one in some cases: "d_splice_alias()".
>
> So I bet that if we just taught "d_splice_alias()" to look at the inode,
> and based on the inode just re-allocate the dentry to some other slab
> cache, we'd already handle a lot of the cases!
>
> And yes, you'd end up with the reallocation overhead quite often, but at
> least it would now happen only when filling in a dentry, not in the
> (*much* more critical) cached lookup path.
>
> Linus
You would only get it for dentries that live long (or your prediction
is awfully wrong) and then the reallocation amortizes over time if you
will. :)
MfG
Goswin
[email protected] (Mel Gorman) writes:
> On (16/09/07 17:08), Andrea Arcangeli didst pronounce:
>> zooming in I see red pixels all over the squares mized with green
>> pixels in the same square. This is exactly what happens with the
>> variable order page cache and that's why it provides zero guarantees
>> in terms of how much ram is really "free" (free as in "available").
>>
>
> This picture is not grouping pages by mobility so that is hardly a
> suprise. This picture is not running grouping pages by mobility. This is
> what the normal kernel looks like. Look at the videos in
> http://www.skynet.ie/~mel/anti-frag/2007-02-28 and see how list-based
> compares to vanilla. These are from February when there was less control
> over mixing blocks than there is today.
>
> In the current version mixing occurs in the lower blocks as much as possible
> not the upper ones. So there are a number of mixed blocks but the number is
> kept to a minimum.
>
> The number of mixed blocks could have been enforced as 0, but I felt it was
> better in the general case to fragment rather than regress performance.
> That may be different for large blocks where you will want to take the
> enforcement steps.
I agree that 0 is a bad value. But so is infinity. There should be
some mixing but not a lot. You say "kept to a minimum". Is that
actively done or already happens by itself. Hopefully the later which
would be just splendid.
>> With config-page-shift mmap works on 4k chunks but it's always backed
>> by 64k or any other largesize that you choosed at compile time. And if
But would mapping a random 4K page out of a file then consume 64k?
That sounds like an awfull lot of internal fragmentation. I hope the
unaligned bits and pices get put into a slab or something as you
suggested previously.
>> the virtual alignment of mmap matches the physical alignment of the
>> physical largepage and is >= PAGE_SIZE (software PAGE_SIZE I mean) we
>> could use the 62nd bit of the pte to use a 64k tlb (if future cpus
>> will allow that). Nick also suggested to still set all ptes equal to
>> make life easier for the tlb miss microcode.
It is too bad that existing amd64 CPUs only allow such large physical
pages. But it kind of makes sense to cut away a full level or page
tables for the next bigger size each.
>> > big you can make it. I don't think my system with 1GB ram would work
>> > so well with 2MB order 0 pages. But I wasn't refering to that but to
>> > the picture.
>>
>> Sure! 2M is sure way excessive for a 1G system, 64k most certainly
>> too, of course unless you're running a db or a multimedia streaming
>> service, in which case it should be ideal.
rtorrent, Xemacs/gnus, bash, xterm, zsh, make, gcc, galeon and the
ocasional mplayer.
I would mostly be concerned how rtorrents totaly random access of
mmapped files negatively impacts such a 64k page system.
MfG
Goswin
Andrea Arcangeli <[email protected]> writes:
> You ignore one other bit, when "/usr/bin/free" says 1G is free, with
> config-page-shift it's free no matter what, same goes for not mlocked
> cache. With variable order page cache, /usr/bin/free becomes mostly a
> lie as long as there's no 4k fallback (like fsblock).
% free
total used free shared buffers cached
Mem: 1398784 1372956 25828 0 225224 321504
-/+ buffers/cache: 826228 572556
Swap: 1048568 20 1048548
When has free ever given any usefull "free" number? I can perfectly
fine allocate another gigabyte of memory despide free saing 25MB. But
that is because I know that the buffer/cached are not locked in.
On the other hand 1GB can instantly vanish when I start a xen domain
and anything relying on the free value would loose.
The only sensible thing for an application concerned with swapping is
to whatch the swapping and then reduce itself. Not the amount
free. Although I wish there were some kernel interface to get a
preasure value of how valuable free pages would be right now. I would
like that for fuse so a userspace filesystem can do caching without
cripling the kernel.
MfG
Goswin
On Fri, Sep 14, 2007 at 06:48:55AM +1000, Nick Piggin wrote:
> On Thursday 13 September 2007 12:01, Nick Piggin wrote:
> > On Thursday 13 September 2007 23:03, David Chinner wrote:
> > > Then just do operations on directories with lots of files in them
> > > (tens of thousands). Every directory operation will require at
> > > least one vmap in this situation - e.g. a traversal will result in
> > > lots and lots of blocks being read that will require vmap() for every
> > > directory block read from disk and an unmap almost immediately
> > > afterwards when the reference is dropped....
> >
> > Ah, wow, thanks: I can reproduce it.
>
> OK, the vunmap batching code wipes your TLB flushing and IPIs off
> the table. Diffstat below, but the TLB portions are here (besides that
> _everything_ is probably lower due to less TLB misses caused by the
> TLB flushing):
>
> -170 -99.4% sn2_send_IPI
> -343 -100.0% sn_send_IPI_phys
> -17911 -99.9% smp_call_function
>
> Total performance went up by 30% on a 64-way system (248 seconds to
> 172 seconds to run parallel finds over different huge directories).
Good start, Nick ;)
>
> 23012 54790.5% _read_lock
> 9427 329.0% __get_vm_area_node
> 5792 0.0% __find_vm_area
> 1590 53000.0% __vunmap
....
_read_lock? I though vmap() and vunmap() only took the vmlist_lock in
write mode.....
> Next I have some patches to scale the vmap locks and data
> structures better, but they're not quite ready yet. This looks like it
> should result in a further speedup of several times when combined
> with the TLB flushing reductions here...
Sounds promising - when you have patches that work, send them my
way and I'll run some tests on them.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On (17/09/07 00:38), Goswin von Brederlow didst pronounce:
> [email protected] (Mel Gorman) writes:
>
> > On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
> >> Mel Gorman <[email protected]> writes:
> >>
> >> > On Fri, 2007-09-14 at 18:10 +0200, Goswin von Brederlow wrote:
> >> >> Nick Piggin <[email protected]> writes:
> >> >>
> >> >> > In my attack, I cause the kernel to allocate lots of unmovable allocations
> >> >> > and deplete movable groups. I theoretically then only need to keep a
> >> >> > small number (1/2^N) of these allocations around in order to DoS a
> >> >> > page allocation of order N.
> >> >>
> >> >> I'm assuming that when an unmovable allocation hijacks a movable group
> >> >> any further unmovable alloc will evict movable objects out of that
> >> >> group before hijacking another one. right?
> >> >>
> >> >
> >> > No eviction takes place. If an unmovable allocation gets placed in a
> >> > movable group, then steps are taken to ensure that future unmovable
> >> > allocations will take place in the same range (these decisions take
> >> > place in __rmqueue_fallback()). When choosing a movable block to
> >> > pollute, it will also choose the lowest possible block in PFN terms to
> >> > steal so that fragmentation pollution will be as confined as possible.
> >> > Evicting the unmovable pages would be one of those expensive steps that
> >> > have been avoided to date.
> >>
> >> But then you can have all blocks filled with movable data, free 4K in
> >> one group, allocate 4K unmovable to take over the group, free 4k in
> >> the next group, take that group and so on. You can end with 4k
> >> unmovable in every 64k easily by accident.
> >>
> >
> > As the mixing takes place at the lowest possible block, it's
> > exceptionally difficult to trigger this. Possible, but exceptionally
> > difficult.
>
> Why is it difficult?
>
Unless mlock() is being used, it is difficult to place the pages in the
way you suggest. Feel free to put together a test program that forces an
adverse fragmentation situation, it'll be useful in the future for comparing
reliability of any large block solution.
> When user space allocates memory wouldn't it get it contiously?
Not unless it's using libhugetlbfs or it's very very early in the
lifetime of the system. Even then, another process faulting at the same
time will break it up.
> I mean
> that is one of the goals, to use larger continious allocations and map
> them with a single page table entry where possible, right?
It's a goal ultimately but not what we do right now. There have been
suggestions of allocating the contiguous pages optimistically in the
fault path and later promoting with an arch-specific hook but it's
vapourware right now.
> And then
> you can roughly predict where an munmap() would free a page.
>
> Say the application does map a few GB of file, uses madvice to tell
> the kernel it needs a 2MB block (to get a continious 2MB chunk
> mapped), waits for it and then munmaps 4K in there. A 4k hole for some
> unmovable object to fill.
With grouping pages by mobility, that 4K hole would be on the movable
free lists. To get an unmovable allocation in there, the system needs to
be under considerable stress. Even just raising min_free_kbytes a bit
would make it considerably harder.
With the standard kernel, it would be easier to place as you suggest.
> If you can then trigger the creation of an
> unmovable object as well (stat some file?) and loop you will fill the
> ram quickly. Maybe it only works in 10% but then you just do it 10
> times as often.
>
> Over long times it could occur naturally. This is just to demonstrate
> it with malice.
>
Try writing such a program. I'd be interested in reading it.
> > As I have stated repeatedly, the guarantees can be made but potential
> > hugepage allocation did not justify it. Large blocks might.
> >
> >> There should be a lot of preassure for movable objects to vacate a
> >> mixed group or you do get fragmentation catastrophs.
> >
> > We (Andy Whitcroft and I) did implement something like that. It hooked into
> > kswapd to clean mixed blocks. If the caller could do the cleaning, it
> > did the work instead of kswapd.
>
> Do you have a graphic like
> http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg
> for that case?
>
Not at the moment. I don't have any of the patches to accurately reflect
it up to date. The eviction patches have rotted to the point of
unusability.
> >> Looking at my
> >> little test program evicting movable objects from a mixed group should
> >> not be that expensive as it doesn't happen often.
> >
> > It happens regularly if the size of the block you need to keep clean is
> > lower than min_free_kbytes. In the case of hugepages, that was always
> > the case.
>
> That assumes that the number of groups allocated for unmovable objects
> will continiously grow and shrink.
They do grow and shrink. The number of pagetables in use changes for
example.
> I'm assuming it will level off at
> some size for long times (hours) under normal operations.
It doesn't unless you assume the system remains in a steady state for it's
lifetime. Things like updatedb tend to throw a spanner into the works.
> There should
> be some buffering of a few groups to be held back in reserve when it
> shrinks to prevent the scenario that the size is just at a group
> boundary and always grows/shrinks by 1 group.
>
And what size should this group be that all workloads function?
> >> The cost of it
> >> should be freeing some pages (or finding free ones in a movable group)
> >> and then memcpy.
> >
> > Freeing pages is not cheap. Copying pages is cheaper but not cheap.
>
> To copy you need a free page as destination. Thats all I
> ment. Hopefully there will always be a free one and the actual freeing
> is done asynchronously from the copying.
>
If the pages are free, finding them isn't that hard. The memory
compaction patches were able to do it.
> >> So if
> >> you evict movable objects from mixed group when needed all the
> >> pagetable pages would end up in the same mixed group slowly taking it
> >> over completly. No fragmentation at all. See how essential that
> >> feature is. :)
> >>
> >
> > To move pages, there must be enough blocks free. That is where
> > min_free_kbytes had to come in. If you cared only about keeping 64KB
> > chunks free, it makes sense but it didn't in the context of hugepages.
>
> I'm more concerned with keeping the little unmovable things out of the
> way. Those are the things that will fragment the memory and prevent
> any huge pages to be available even with moving other stuff out of the
> way.
>
That's fair, just not cheap
> It would also already be a big plus to have 64k continious chunks for
> many operations. Guaranty that the filesystem and block layers can
> always get such a page (by means of copying pages out of the way when
> needed) and do even larger pages speculative.
>
64K contiguous chunks is what Andrea's apprach seeks to do so lets see what
that looks like. Or lets see if Nick's approach makes the whole exercise
pointless
> But as you say that is where min_free_kbytes comes in. To have the
> chance of a 2MB continious free chunk you need to have much more
> reserved free. Something I would certainly do on a 16GB ram
> server. Note that the buddy system will be better at having large free
> chunks if user space allocates and frees large chunks as well.
This is true and one of the reasons that Andrea's approach will go a
long way for large blocks. This observation is also why I suggested
having slub_min_order set the same as the largeblock size when
Christophs approach was used to have this allocating/freeing of
same-sized chunks.
> It is
> much easier to make an 128K chunk out of 64K chunks than out of 4K
> chunks. Much more probable to get 2 64k chunks adjacent than 32 4K
> chunks in series.
>
> >> > These type of pictures feel somewhat familiar
> >> > (http://www.skynet.ie/~mel/anti-frag/2007-02-28/page_type_distribution.jpg).
> >>
> >> Those look a lot better and look like they are actually real kernel
> >> data. How did you make them and can one create them in real-time (say
> >> once a second or so)?
> >>
> >
> > It's from a real kernel. When I was measuring this stuff, I took a
> > sample every 2 seconds.
>
> Can you tell me how? I would like to do the same.
>
They were generated using trace_allocmap kernel module in
http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.81-rc2.tar.gz
in combination with frag-display in the same package. However, in the
current version against current -mm's, it'll identify some movable pages
wrong. Specifically, it will appear to be mixing movable pages with slab
pages and it doesn't identify SLUB pages properly at all (SLUB came after
the last revision of this tool). I need to bring an annotation patch up to
date before it can generate the images correctly.
> >> There seem to be an awfull lot of pinned pages inbetween the movable.
> >
> > It wasn't grouping by mobility at the time.
>
> That might explain at lot. I was thinking that was with grouping. But
> without grouping such a rather random distribution of unmovable
> objects doesn't sound uncommon,.
>
It's expected even.
> >> I would verry much like to see the same data with evicting of movable
> >> pages out of mixed groups. I see not a single movable group while with
> >> strict eviction there could be at most one mixed group per order.
> >>
> >
> > With strict eviction, there would be no mixed blocks period.
>
> I ment on demand eviction. Not evicting the whole group just because
> we need 4k of it. Even looser would be to allow say 16 mixed groups
> before moving pages out of the way when needed for an alloc. Best to
> make it a proc entry and then change the value once a day and see how
> it behaves. :)
>
That is a tunable I'd rather avoid because it'll be impossible to recommend
a proper value - I would sooner recommend increasing min_free_kbytes. It's
and not something I'm likely to investigate until we really know we need
it. I would prefer to see how Nick's, Christoph's and Andrea's approaches
get on before taking those type of steps.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On (17/09/07 00:48), Goswin von Brederlow didst pronounce:
> [email protected] (Mel Gorman) writes:
>
> > On (16/09/07 17:08), Andrea Arcangeli didst pronounce:
> >> zooming in I see red pixels all over the squares mized with green
> >> pixels in the same square. This is exactly what happens with the
> >> variable order page cache and that's why it provides zero guarantees
> >> in terms of how much ram is really "free" (free as in "available").
> >>
> >
> > This picture is not grouping pages by mobility so that is hardly a
> > suprise. This picture is not running grouping pages by mobility. This is
> > what the normal kernel looks like. Look at the videos in
> > http://www.skynet.ie/~mel/anti-frag/2007-02-28 and see how list-based
> > compares to vanilla. These are from February when there was less control
> > over mixing blocks than there is today.
> >
> > In the current version mixing occurs in the lower blocks as much as possible
> > not the upper ones. So there are a number of mixed blocks but the number is
> > kept to a minimum.
> >
> > The number of mixed blocks could have been enforced as 0, but I felt it was
> > better in the general case to fragment rather than regress performance.
> > That may be different for large blocks where you will want to take the
> > enforcement steps.
>
> I agree that 0 is a bad value. But so is infinity. There should be
> some mixing but not a lot. You say "kept to a minimum". Is that
> actively done or already happens by itself. Hopefully the later which
> would be just splendid.
>
Happens by itself due to biasing mixing blocks at lower PFNs. The exact
number is unknown. We used to track it a long time ago but not any more.
> >> With config-page-shift mmap works on 4k chunks but it's always backed
> >> by 64k or any other largesize that you choosed at compile time. And if
>
> But would mapping a random 4K page out of a file then consume 64k?
> That sounds like an awfull lot of internal fragmentation. I hope the
> unaligned bits and pices get put into a slab or something as you
> suggested previously.
>
This is a possibility but Andrea seems confident he can handle it.
> >> the virtual alignment of mmap matches the physical alignment of the
> >> physical largepage and is >= PAGE_SIZE (software PAGE_SIZE I mean) we
> >> could use the 62nd bit of the pte to use a 64k tlb (if future cpus
> >> will allow that). Nick also suggested to still set all ptes equal to
> >> make life easier for the tlb miss microcode.
>
> It is too bad that existing amd64 CPUs only allow such large physical
> pages. But it kind of makes sense to cut away a full level or page
> tables for the next bigger size each.
>
Yep on both counts.
> >> > big you can make it. I don't think my system with 1GB ram would work
> >> > so well with 2MB order 0 pages. But I wasn't refering to that but to
> >> > the picture.
> >>
> >> Sure! 2M is sure way excessive for a 1G system, 64k most certainly
> >> too, of course unless you're running a db or a multimedia streaming
> >> service, in which case it should be ideal.
>
> rtorrent, Xemacs/gnus, bash, xterm, zsh, make, gcc, galeon and the
> ocasional mplayer.
>
> I would mostly be concerned how rtorrents totaly random access of
> mmapped files negatively impacts such a 64k page system.
>
For what it's worth, the last allocation failure that occured with
grouping pages by mobility was order-1 atomic failures for a wireless
network card when bittorrent was running. You're likely right in that
torrents will be an interesting workload in terms of fragmentation.
--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
> [email protected] (Mel Gorman) writes:
>
> > On (15/09/07 14:14), Goswin von Brederlow didst pronounce:
> >> Andrew Morton <[email protected]> writes:
> >>
> >> > On Tue, 11 Sep 2007 14:12:26 +0200 J?rn Engel <[email protected]> wrote:
> >> >
> >> >> While I agree with your concern, those numbers are quite silly. The
> >> >> chances of 99.8% of pages being free and the remaining 0.2% being
> >> >> perfectly spread across all 2MB large_pages are lower than those of SHA1
> >> >> creating a collision.
> >> >
> >> > Actually it'd be pretty easy to craft an application which allocates seven
> >> > pages for pagecache, then one for <something>, then seven for pagecache, then
> >> > one for <something>, etc.
> >> >
> >> > I've had test apps which do that sort of thing accidentally. The result
> >> > wasn't pretty.
> >>
> >> Except that the applications 7 pages are movable and the <something>
> >> would have to be unmovable. And then they should not share the same
> >> memory region. At least they should never be allowed to interleave in
> >> such a pattern on a larger scale.
> >>
> >
> > It is actually really easy to force regions to never share. At the
> > moment, there is a fallback list that determines a preference for what
> > block to mix.
> >
> > The reason why this isn't enforced is the cost of moving. On x86 and
> > x86_64, a block of interest is usually 2MB or 4MB. Clearing out one of
> > those pages to prevent any mixing would be bad enough. On PowerPC, it's
> > potentially 16MB. On IA64, it's 1GB.
> >
> > As this was fragmentation avoidance, not guarantees, the decision was
> > made to not strictly enforce the types of pages within a block as the
> > cost cannot be made back unless the system was making agressive use of
> > large pages. This is not the case with Linux.
>
> I don't say the group should never be mixed. The movable objects could
> be moved out on demand. If 64k get allocated then up to 64k get
> moved.
This type of action makes sense in the context of Andrea's approach and
large blocks. I don't think it makes sense today to do it in the general
case, at least not yet.
> That would reduce the impact as the kernel does not hang while
> it moves 2MB or even 1GB. It also allows objects to be freed and the
> space reused in the unmovable and mixed groups. There could also be a
> certain number or percentage of mixed groupd be allowed to further
> increase the chance of movable objects freeing themself from mixed
> groups.
>
> But when you already have say 10% of the ram in mixed groups then it
> is a sign the external fragmentation happens and some time should be
> spend on moving movable objects.
>
I'll play around with it on the side and see what sort of results I get.
I won't be pushing anything any time soon in relation to this though.
For now, I don't intend to fiddle more with grouping pages by mobility
for something that may or may not be of benefit to a feature that hasn't
been widely tested with what exists today.
> >> The only way a fragmentation catastroph can be (proovable) avoided is
> >> by having so few unmovable objects that size + max waste << ram
> >> size. The smaller the better. Allowing movable and unmovable objects
> >> to mix means that max waste goes way up. In your example waste would
> >> be 7*size. With 2MB uper order limit it would be 511*size.
> >>
> >> I keep coming back to the fact that movable objects should be moved
> >> out of the way for unmovable ones. Anything else just allows
> >> fragmentation to build up.
> >>
> >
> > This is easily achieved, just really really expensive because of the
> > amount of copying that would have to take place. It would also compel
> > that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
> > MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
> > a lot of free memory to keep around which is why fragmentation avoidance
> > doesn't do it.
>
> In your sample graphics you had 1152 groups. Reserving a few of those
> doesnt sound too bad.
No, which on those systems, I would suggest setting min_free_kbytes to a
higher value. Doesn't work as well on IA-64.
> And how many migrate types do we talk about.
> So
> far we only had movable and unmovable.
Movable, unmovable, reclaimable and reserve in the current incarnation
of grouping pages by mobility.
> I would split unmovable into
> short term (caches, I/O pages) and long term (task structures,
> dentries).
Mostly done as you suggest already. Dentry are considered reclaimable, not
long-lived though and are grouped with things like inode caches for example.
> Reserving 6 groups for schort term unmovable and long term
> unmovable would be 1% of ram in your situation.
>
More groups = more cost although very easy to add them. A mixed type
used to exist but was removed again because it couldn't be proved to be
useful at the time.
> Maybe instead of reserving one could say that you can have up to 6
> groups of space
And if the groups are 1GB in size? I tried something like this already.
It didn't work out well at the time although I could revisit.
> not used by unmovable objects before aggressive moving
> starts. I don't quite see why you NEED reserving as long as there is
> enough space free alltogether in case something needs moving.
hence, increase min_free_kbytes.
> 1 group
> worth of space free might be plenty to move stuff too. Note that all
> the virtual pages can be stuffed in every little free space there is
> and reassembled by the MMU. There is no space lost there.
>
What you suggest sounds similar to having a type MIGRATE_MIXED where you
allocate from when the preferred lists are full. It became a sizing
problem that never really worked out. As I said, I can try again.
> But until one tries one can't say.
>
> MfG
> Goswin
>
> PS: How do allocations pick groups?
Using GFP flags to identify the type.
> Could one use the oldest group
> dedicated to each MIGRATE_TYPE?
Age is difficult to determine so probably not.
> Or lowest address for unmovable and
> highest address for movable? Something to better keep the two out of
> each other way.
We bias the location of unmovable and reclaimable allocations already. It's
not done for movable because it wasn't necessary (as they are easily
reclaimed or moved anyway).
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On (16/09/07 23:31), Andrea Arcangeli didst pronounce:
> On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
> > The 16MB is the size of a hugepage, the size of interest as far as I am
> > concerned. Your idea makes sense for large block support, but much less
> > for huge pages because you are incurring a cost in the general case for
> > something that may not be used.
>
> Sorry for the misunderstanding, I totally agree!
>
Great. It's clear that we had different use cases in mind when we were
poking holes in each approach.
> > There is nothing to say that both can't be done. Raise the size of
> > order-0 for large block support and continue trying to group the block
> > to make hugepage allocations likely succeed during the lifetime of the
> > system.
>
> Sure, I completely agree.
>
> > At the risk of repeating, your approach will be adding a new and
> > significant dimension to the internal fragmentation problem where a
> > kernel allocation may fail because the larger order-0 pages are all
> > being pinned by userspace pages.
>
> This is exactly correct, some memory will be wasted. It'll reach 0
> free memory more quickly depending on which kind of applications are
> being run.
>
I look forward to seeing how you deal with it. When/if you get to trying
to move pages out of slabs, I suggest you take a look at the Memory
Compaction patches or the memory unplug patches for simple examples of
how to use page migration.
> > It improves the probabilty of hugepage allocations working because the
> > blocks with slab pages can be targetted and cleared if necessary.
>
> Agreed.
>
> > That suggestion was aimed at the large block support more than
> > hugepages. It helps large blocks because we'll be allocating and freeing
> > as more or less the same size. It certainly is easier to set
> > slub_min_order to the same size as what is needed for large blocks in
> > the system than introducing the necessary mechanics to allocate
> > pagetable pages and userspace pages from slab.
>
> Allocating userpages from slab in 4k chunks with a 64k PAGE_SIZE is
> really complex indeed. I'm not planning for that in the short term but
> it remains a possibility to make the kernel more generic. Perhaps it
> could worth it...
>
Perhaps.
> Allocating ptes from slab is fairly simple but I think it would be
> better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
> nearby ptes in the per-task local pagetable tree, to reduce the number
> of locks taken and not to enter the slab at all for that.
It runs the risk of pinning up to 60K of data per task that is unusable for
any other purpose. On average, it'll be more like 32K but worth keeping
in mind.
> Infact we
> could allocate the 4 levels (or anyway more than one level) in one
> single alloc_pages(0) and track the leftovers in the mm (or similar).
>
> > I'm not sure what you are getting at here. I think it would make more
> > sense if you said "when you read /proc/buddyinfo, you know the order-0
> > pages are really free for use with large blocks" with your approach.
>
> I'm unsure who reads /proc/buddyinfo (that can change a lot and that
> is not very significant information if the vm can defrag well inside
> the reclaim code),
I read it although as you say, it's difficult to know what will happen if
you try and reclaim memory. It's why there is also a /proc/pagetypeinfo so
one can see the number of movable blocks that exist. That leads to better
guessing. In -mm, you can also see the number of mixed blocks but that will
not be available in mainline.
> but it's not much different and it's more about
> knowing the real meaning of /proc/meminfo, freeable (unmapped) cache,
> anon ram, and free memory.
>
> The idea is that to succeed an mmap over a large xfs file with
> mlockall being invoked, those largepages must become available or
> it'll be oom despite there are still 512M free... I'm quite sure
> admins will gets confused if they get oom killer invoked with lots of
> ram still free.
>
> The overcommit feature will also break, just to make an example (so
> much for overcommit 2 guaranteeing -ENOMEM retvals instead of oom
> killage ;).
>
> > All this aside, there is nothing mutually exclusive with what you are proposing
> > and what grouping pages by mobility does. Your stuff can exist even if grouping
> > pages by mobility is in place. In fact, it'll help us by giving an important
> > comparison point as grouping pages by mobility can be trivially disabled with
> > a one-liner for the purposes of testing. If your approach is brought to being
> > a general solution that also helps hugepage allocation, then we can revisit
> > grouping pages by mobility by comparing kernels with it enabled and without.
>
> Yes, I totally agree. It sounds worthwhile to have a good defrag logic
> in the VM. Even allocating the kernel stack in today kernels should be
> able to benefit from your work. It's just comparing a fork() failure
> (something that will happen with ulimit -n too and that apps must be
> able to deal with) with an I/O failure that worries me a bit.
I agree here with the IO failure. It's why I've been saying that Nick's
approach is what was needed to make large blocks a fully supported
feature. Your approach may work out as well. While those are in development,
Christoph's approach will let us know how regularly these failures actually
occur so we'll know in advance how often the slower paths in both Nick's
and your approach will be exercised.
> I'm
> quite sure a db failing I/O will not recover too nicely. If fork fails
> that's most certainly ok... at worst a new client won't be able to
> connect and he can retry later. Plus order 1 isn't really a big deal,
> you know the probability to succeeds decreases exponentially with the
> order.
>
--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Christoph Lameter wrote:
> True. That is why we want to limit the number of unmovable allocations and
> that is why ZONE_MOVABLE exists to limit those. However, unmovable
> allocations are already rare today. The overwhelming majority of
> allocations are movable and reclaimable. You can see that f.e. by looking
> at /proc/meminfo and see how high SUnreclaim: is (does not catch
> everything but its a good indicator).
Just to inject another factor into the discussion, please remember that
Linux also runs on nommu systems, where things like user space
allocations are neither movable nor reclaimable.
Bernd
--
This footer brought to you by insane German lawmakers.
Analog Devices GmbH Wilhelm-Wagenfeld-Str. 6 80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
On Saturday 15 September 2007 03:52, Christoph Lameter wrote:
> On Fri, 14 Sep 2007, Nick Piggin wrote:
> > > > [*] ok, this isn't quite true because if you can actually put a hard
> > > > limit on unmovable allocations then anti-frag will fundamentally help
> > > > -- get back to me on that when you get patches to move most of the
> > > > obvious ones.
> > >
> > > We have this hard limit using ZONE_MOVABLE in 2.6.23.
> >
> > So we're back to 2nd class support.
>
> 2nd class support for me means a feature that is not enabled by default
> but that can be enabled in order to increase performance. The 2nd class
> support is there because we are not yet sure about the maturity of the
> memory allocation methods.
I'd rather an approach that does not require all these hacks.
> > > Reserve pools as handled (by the not yet available) large page pool
> > > patches (which again has altogether another purpose) are not a limit.
> > > The reserve pools are used to provide a mininum of higher order pages
> > > that is not broken down in order to insure that a mininum number of the
> > > desired order of pages is even available in your worst case scenario.
> > > Mainly I think that is needed during the period when memory
> > > defragmentation is still under development.
> >
> > fsblock doesn't need any of those hacks, of course.
>
> Nor does mine for the low orders that we are considering. For order >
> MAX_ORDER this is unavoidable since the page allocator cannot manage such
> large pages. It can be used for lower order if there are issues (that I
> have not seen yet).
Or we can just avoid all doubt (and doesn't have arbitrary limitations
according to what you think might be reasonable or how well the
system actually behaves).
On Saturday 15 September 2007 04:08, Christoph Lameter wrote:
> On Fri, 14 Sep 2007, Nick Piggin wrote:
> > However fsblock can do everything that higher order pagecache can
> > do in terms of avoiding vmap and giving contiguous memory to block
> > devices by opportunistically allocating higher orders of pages, and
> > falling back to vmap if they cannot be satisfied.
>
> fsblock is restricted to the page cache and cannot be used in other
> contexts where subsystems can benefit from larger linear memory.
Unless you believe higher order pagecache is not restricted to the
pagecache, can we please just stick on topic of fsblock vs higher
order pagecache?
> > So if you argue that vmap is a downside, then please tell me how you
> > consider the -ENOMEM of your approach to be better?
>
> That is again pretty undifferentiated. Are we talking about low page
In general.
> orders? There we will reclaim the all of reclaimable memory before getting
> an -ENOMEM. Given the quantities of pages on todays machine--a 1 G machine
> has 256 milllion 4k pages--and the unmovable ratios we see today it
> would require a very strange setup to get an allocation failure while
> still be able to allocate order 0 pages.
So much handwaving. 1TB machines without "very strange setups"
(where very strange is something arbitrarily defined by you) I guess make
up 0.0000001% of Linux installations.
> With the ZONE_MOVABLE you can remove the unmovable objects into a defined
> pool then higher order success rates become reasonable.
OK, if you rely on reserve pools, then it is not 1st class support and hence
it is a non-solution to VM and IO scalability problems.
> > If, by special software layer, you mean the vmap/vunmap support in
> > fsblock, let's see... that's probably all of a hundred or two lines.
> > Contrast that with anti-fragmentation, lumpy reclaim, higher order
> > pagecache and its new special mmap layer... Hmm, seems like a no
> > brainer to me. You really still want to persue the "extra layer"
> > argument as a point against fsblock here?
>
> Yes sure. You code could not live without these approaches. Without the
Actually: your code is the one that relies on higher order allocations. Now
you're trying to turn that into an argument against fsblock?
> antifragmentation measures your fsblock code would not be very successful
> in getting the larger contiguous segments you need to improve performance.
Complely wrong. *I* don't need to do any of that to improve performance.
Actually the VM is well tuned for order-0 pages, and so seeing as I have
sane hardware, 4K pagecache works beautifully for me.
My point was this: fsblock does not preclude your using such measures to
fix the performance of your hardware that's broken with 4K pages. And it
would allow higher order allocations to fail gracefully.
> (There is no new mmap layer, the higher order pagecache is simply the old
> API with set_blocksize expanded).
Yes you add another layer in the userspace mapping code to handle higher
order pagecache.
> > Of course I wouldn't state that. On the contrary, I categorically state
> > that I have already solved it :)
>
> Well then I guess that you have not read the requirements...
I'm not talking about solving your problem of poor hardware. I'm talking
about supporting higher order block sizes in the kernel.
> > > Because it has already been rejected in another form and adds more
> >
> > You have rejected it. But they are bogus reasons, as I showed above.
>
> Thats not me. I am working on this because many of the filesystem people
> have repeatedly asked me to do this. I am no expert on filesystems.
Yes it is you. You brought up reasons in this thread and I said why I thought
they were bogus. If you think you can now forget about my shooting them
down by saying you aren't an expert in filesystems, then you shouldn't have
brought them up in the first place. Either stand by your arguments or don't.
> > You also describe some other real (if lesser) issues like number of page
> > structs to manage in the pagecache. But this is hardly enough to reject
> > my patch now... for every downside you can point out in my approach, I
> > can point out one in yours.
> >
> > - fsblock doesn't require changes to virtual memory layer
>
> Therefore it is not a generic change but special to the block layer. So
> other subsystems still have to deal with the single page issues on
> their own.
Rubbish. fsblock doesn't touch a single line in the block layer.
> > > Maybe we coud get to something like a hybrid that avoids some of these
> > > issues? Add support so something like a virtual compound page can be
> > > handled transparently in the filesystem layer with special casing if
> > > such a beast reaches the block layer?
> >
> > That's conceptually much worse, IMO.
>
> Why: It is the same approach that you use.
Again, rubbish.
> If it is barely ever used and
> satisfies your concern then I am fine with it.
Right below this line is where I told you I am _not_ fine with it.
> > And practically worse as well: vmap space is limited on 32-bit; fsblock
> > approach can avoid vmap completely in many cases; for two reasons.
> >
> > The fsblock data accessor APIs aren't _that_ bad changes. They change
> > zero conceptually in the filesystem, are arguably cleaner, and can be
> > essentially nooped if we wanted to stay with a b_data type approach
> > (but they give you that flexibility to replace it with any
> > implementation).
>
> The largeblock changes are generic.
Is that supposed to imply fsblock isn't generic? Or that the fsblock
API isn't a good one?
> They improve general handling of
> compound pages, they make the existing APIs work for large units of
> memory, they are not adding additional new API layers.
Neither does fsblock.
On Monday 17 September 2007 04:13, Mel Gorman wrote:
> On (15/09/07 14:14), Goswin von Brederlow didst pronounce:
> > I keep coming back to the fact that movable objects should be moved
> > out of the way for unmovable ones. Anything else just allows
> > fragmentation to build up.
>
> This is easily achieved, just really really expensive because of the
> amount of copying that would have to take place. It would also compel
> that min_free_kbytes be at least one free PAGEBLOCK_NR_PAGES and likely
> MIGRATE_TYPES * PAGEBLOCK_NR_PAGES to reduce excessive copying. That is
> a lot of free memory to keep around which is why fragmentation avoidance
> doesn't do it.
I don't know how it would prevent fragmentation from building up
anyway. It's commonly the case that potentially unmovable objects
are allowed to fill up all of ram (dentries, inodes, etc).
And of course, if you craft your exploit nicely with help from higher
ordered unmovable memory (eg. mm structs or unix sockets), then
you don't even need to fill all memory with unmovables before you
can have them take over all groups.
On Monday 17 September 2007 14:07, David Chinner wrote:
> On Fri, Sep 14, 2007 at 06:48:55AM +1000, Nick Piggin wrote:
> > OK, the vunmap batching code wipes your TLB flushing and IPIs off
> > the table. Diffstat below, but the TLB portions are here (besides that
> > _everything_ is probably lower due to less TLB misses caused by the
> > TLB flushing):
> >
> > -170 -99.4% sn2_send_IPI
> > -343 -100.0% sn_send_IPI_phys
> > -17911 -99.9% smp_call_function
> >
> > Total performance went up by 30% on a 64-way system (248 seconds to
> > 172 seconds to run parallel finds over different huge directories).
>
> Good start, Nick ;)
I didn't have the chance to test against a 16K directory block size to find
the "optimal" performance, but it is something I will do (I'm sure it will be
still a _lot_ faster than 172 seconds :)).
> > 23012 54790.5% _read_lock
> > 9427 329.0% __get_vm_area_node
> > 5792 0.0% __find_vm_area
> > 1590 53000.0% __vunmap
>
> ....
>
> _read_lock? I though vmap() and vunmap() only took the vmlist_lock in
> write mode.....
Yeah, it is a slight change... the lazy vunmap only has to take it for read.
In practice, I'm not sure that it helps a great deal because everything else
still takes the lock for write. But that explains why it's popped up in the
profile.
> > Next I have some patches to scale the vmap locks and data
> > structures better, but they're not quite ready yet. This looks like it
> > should result in a further speedup of several times when combined
> > with the TLB flushing reductions here...
>
> Sounds promising - when you have patches that work, send them my
> way and I'll run some tests on them.
Still away from home (for the next 2 weeks), so I'll be going a bit slow :P
I'm thinking about various scalable locking schemes and I'll definitely
ping you when I've made a bit of progress.
Thanks,
Nick
On Sun, 16 Sep 2007, Nick Piggin wrote:
> I don't know how it would prevent fragmentation from building up
> anyway. It's commonly the case that potentially unmovable objects
> are allowed to fill up all of ram (dentries, inodes, etc).
Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from
ZONE_MOVABLE and thus the memory that can be allocated for them is
limited.
On Sun, 16 Sep 2007, J?rn Engel wrote:
> I bet! My (false) assumption was the same as Goswin's. If non-movable
> pages are clearly seperated from movable ones and will evict movable
> ones before polluting further mixed superpages, Nick's scenario would be
> nearly infinitely impossible.
>
> Assumption doesn't reflect current code. Enforcing this assumption
> would cost extra overhead. The amount of effort to make Christoph's
> approach work reliably seems substantial and I have no idea whether it
> would be worth it.
My approach is based on Mel's code and is already working the way you
describe. Page cache allocs are marked __GFP_MOVABLE by Mel's work.
On Sun, 16 Sep 2007, Nick Piggin wrote:
> > > fsblock doesn't need any of those hacks, of course.
> >
> > Nor does mine for the low orders that we are considering. For order >
> > MAX_ORDER this is unavoidable since the page allocator cannot manage such
> > large pages. It can be used for lower order if there are issues (that I
> > have not seen yet).
>
> Or we can just avoid all doubt (and doesn't have arbitrary limitations
> according to what you think might be reasonable or how well the
> system actually behaves).
We can avoid all doubt in this patchset as well by adding support for
fallback to a vmalloced compound page.
On Mon, 17 Sep 2007, Bernd Schmidt wrote:
> Christoph Lameter wrote:
> > True. That is why we want to limit the number of unmovable allocations and
> > that is why ZONE_MOVABLE exists to limit those. However, unmovable
> > allocations are already rare today. The overwhelming majority of allocations
> > are movable and reclaimable. You can see that f.e. by looking at
> > /proc/meminfo and see how high SUnreclaim: is (does not catch everything but
> > its a good indicator).
>
> Just to inject another factor into the discussion, please remember that Linux
> also runs on nommu systems, where things like user space allocations are
> neither movable nor reclaimable.
Hmmm.... However, sorting of the allocations would result in avoiding
defragmentation to some degree?
On Sun, 16 Sep 2007, Nick Piggin wrote:
> > > So if you argue that vmap is a downside, then please tell me how you
> > > consider the -ENOMEM of your approach to be better?
> >
> > That is again pretty undifferentiated. Are we talking about low page
>
> In general.
There is no -ENOMEM approach. Lower order page allocation (<
PAGE_ALLOC_COSTLY_ORDER) will reclaim and in the worst case the OOM killer
will be activated. That is the nature of the failures that we saw early in
the year when this was first merged into mm.
> > With the ZONE_MOVABLE you can remove the unmovable objects into a defined
> > pool then higher order success rates become reasonable.
>
> OK, if you rely on reserve pools, then it is not 1st class support and hence
> it is a non-solution to VM and IO scalability problems.
ZONE_MOVABLE creates two memory pools in a machine. One of them is for
movable and one for unmovable. That is in 2.6.23. So 2.6.23 has no first
call support for order 0 pages?
> > > If, by special software layer, you mean the vmap/vunmap support in
> > > fsblock, let's see... that's probably all of a hundred or two lines.
> > > Contrast that with anti-fragmentation, lumpy reclaim, higher order
> > > pagecache and its new special mmap layer... Hmm, seems like a no
> > > brainer to me. You really still want to persue the "extra layer"
> > > argument as a point against fsblock here?
> >
> > Yes sure. You code could not live without these approaches. Without the
>
> Actually: your code is the one that relies on higher order allocations. Now
> you're trying to turn that into an argument against fsblock?
fsblock also needs contiguous pages in order to have a beneficial
effect that we seem to be looking for.
> > antifragmentation measures your fsblock code would not be very successful
> > in getting the larger contiguous segments you need to improve performance.
>
> Complely wrong. *I* don't need to do any of that to improve performance.
> Actually the VM is well tuned for order-0 pages, and so seeing as I have
> sane hardware, 4K pagecache works beautifully for me.
Sure the system works fine as is. Not sure why we would need fsblock then.
> > (There is no new mmap layer, the higher order pagecache is simply the old
> > API with set_blocksize expanded).
>
> Yes you add another layer in the userspace mapping code to handle higher
> order pagecache.
That would imply a new API or something? I do not see it.
> > Why: It is the same approach that you use.
>
> Again, rubbish.
Ok the logical conclusion from the above is that you think your approach
is rubbish.... Is there some way you could cool down a bit?
On (17/09/07 15:00), Christoph Lameter didst pronounce:
> On Sun, 16 Sep 2007, Nick Piggin wrote:
>
> > I don't know how it would prevent fragmentation from building up
> > anyway. It's commonly the case that potentially unmovable objects
> > are allowed to fill up all of ram (dentries, inodes, etc).
>
> Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from
> ZONE_MOVABLE and thus the memory that can be allocated for them is
> limited.
>
As Nick points out, having to configure something makes it a #2
solution. However, I at least am ok with that. ZONE_MOVABLE is a get-out
clause to be able to control fragmentation no matter what the workload is
as it gives hard guarantees. Even when ZONE_MOVABLE is replaced by some
mechanism in grouping pages by mobility to force a number of blocks to be
MIGRATE_MOVABLE_ONLY, the emergency option will exist,
We still lack data on what sort of workloads really benefit from large
blocks (assuming there are any that cannot also be solved by improving
order-0). With Christophs approach + grouping pages by mobility +
ZONE_MOVABLE-if-it-screws-up, people can start collecting that data over the
course of the next few months while we're waiting for fsblock or software
pagesize to mature.
Do we really need to keep discussing this as no new point has been made ina
while? Can we at least take out the non-contentious parts of Christoph's
patches such as the page cache macros and do something with them?
--
Mel "tired of typing" Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Tue, 18 September 2007 11:00:40 +0100, Mel Gorman wrote:
>
> We still lack data on what sort of workloads really benefit from large
> blocks
Compressing filesystems like jffs2 and logfs gain better compression
ratio with larger blocks. Going from 4KiB to 64KiB gave somewhere
around 10% benefit iirc. Testdata was a 128MiB qemu root filesystem.
Granted, the same could be achieved by adding some extra code and a few
bounce buffers to the filesystem. How suck a hack would perform I'd
prefer not to find out, though. :)
Jörn
--
Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface.
-- Doug MacIlroy
On Tue, Sep 18, 2007 at 11:00:40AM +0100, Mel Gorman wrote:
> We still lack data on what sort of workloads really benefit from large
> blocks (assuming there are any that cannot also be solved by improving
> order-0).
No we don't. All workloads benefit from larger block sizes when
you've got a btree tracking 20 million inodes and a create has to
search that tree for a free inode. The tree gets much wider and
hence we take fewer disk seeks to traverse the tree. Same for large
directories, btree's tracking free space, etc - everything goes
faster with a larger filesystem block size because we spent less
time doing metadata I/O.
And the other advantage is that sequential I/O speeds also tend to
increase with larger block sizes. e.g. XFS on an Altix (16k pages)
using 16k block size is about 20-25% faster on writes than 4k block
size. See the graphs at the top of page 12:
http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf
The benefits are really about scalability and with terabyte sized
disks on the market.....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Tuesday 18 September 2007 08:05, Christoph Lameter wrote:
> On Sun, 16 Sep 2007, Nick Piggin wrote:
> > > > fsblock doesn't need any of those hacks, of course.
> > >
> > > Nor does mine for the low orders that we are considering. For order >
> > > MAX_ORDER this is unavoidable since the page allocator cannot manage
> > > such large pages. It can be used for lower order if there are issues
> > > (that I have not seen yet).
> >
> > Or we can just avoid all doubt (and doesn't have arbitrary limitations
> > according to what you think might be reasonable or how well the
> > system actually behaves).
>
> We can avoid all doubt in this patchset as well by adding support for
> fallback to a vmalloced compound page.
How would you do a vmapped fallback in your patchset? How would
you keep track of pages 2..N if they don't exist in the radix tree?
What if they don't even exist in the kernel's linear mapping? It seems
you would also require more special casing in the fault path and special
casing in the block layer to do this.
It's not a trivial problem you can just brush away by handwaving. Let's
see... you could add another field in struct page to store the vmap
virtual address, and set a new flag to indicate indicate that constituent
page N can be found via vmalloc_to_page(page->vaddr + N*PAGE_SIZE).
Then add more special casing to the block layer and fault path etc. to handle
these new non-contiguous compound pages. I guess you must have thought
about it much harder than the 2 minutes I just did then, so you must have a
much nicer solution...
But even so, you're still trying very hard to avoid touching the filesystems
or buffer layer while advocating instead to squeeze the complexity out into
the vm and block layer. I don't agree that is the right thing to do. Sure it
is _easier_, because we know the VM.
I don't argue that fsblock large block support is trivial. But you are first
asserting that it is too complicated and then trying to address one of the
issues it solves by introducing complexity elsewhere.
On Tuesday 18 September 2007 08:00, Christoph Lameter wrote:
> On Sun, 16 Sep 2007, Nick Piggin wrote:
> > I don't know how it would prevent fragmentation from building up
> > anyway. It's commonly the case that potentially unmovable objects
> > are allowed to fill up all of ram (dentries, inodes, etc).
>
> Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from
> ZONE_MOVABLE and thus the memory that can be allocated for them is
> limited.
Why would ZONE_MOVABLE require that "movable objects should be moved
out of the way for unmovable ones"? It never _has_ any unmovable objects in
it. Quite obviously we were not talking about reserve zones.
On Tuesday 18 September 2007 08:21, Christoph Lameter wrote:
> On Sun, 16 Sep 2007, Nick Piggin wrote:
> > > > So if you argue that vmap is a downside, then please tell me how you
> > > > consider the -ENOMEM of your approach to be better?
> > >
> > > That is again pretty undifferentiated. Are we talking about low page
> >
> > In general.
>
> There is no -ENOMEM approach. Lower order page allocation (<
> PAGE_ALLOC_COSTLY_ORDER) will reclaim and in the worst case the OOM killer
> will be activated.
ROFL! Yeah of course, how could I have forgotten about our trusty OOM killer
as the solution to the fragmentation problem? It would only have been funnier
if you had said to reboot every so often when memory gets fragmented :)
> That is the nature of the failures that we saw early in
> the year when this was first merged into mm.
>
> > > With the ZONE_MOVABLE you can remove the unmovable objects into a
> > > defined pool then higher order success rates become reasonable.
> >
> > OK, if you rely on reserve pools, then it is not 1st class support and
> > hence it is a non-solution to VM and IO scalability problems.
>
> ZONE_MOVABLE creates two memory pools in a machine. One of them is for
> movable and one for unmovable. That is in 2.6.23. So 2.6.23 has no first
> call support for order 0 pages?
What?
> > > > If, by special software layer, you mean the vmap/vunmap support in
> > > > fsblock, let's see... that's probably all of a hundred or two lines.
> > > > Contrast that with anti-fragmentation, lumpy reclaim, higher order
> > > > pagecache and its new special mmap layer... Hmm, seems like a no
> > > > brainer to me. You really still want to persue the "extra layer"
> > > > argument as a point against fsblock here?
> > >
> > > Yes sure. You code could not live without these approaches. Without the
> >
> > Actually: your code is the one that relies on higher order allocations.
> > Now you're trying to turn that into an argument against fsblock?
>
> fsblock also needs contiguous pages in order to have a beneficial
> effect that we seem to be looking for.
Keyword: relies.
> > > antifragmentation measures your fsblock code would not be very
> > > successful in getting the larger contiguous segments you need to
> > > improve performance.
> >
> > Complely wrong. *I* don't need to do any of that to improve performance.
> > Actually the VM is well tuned for order-0 pages, and so seeing as I have
> > sane hardware, 4K pagecache works beautifully for me.
>
> Sure the system works fine as is. Not sure why we would need fsblock then.
Large block filesystem.
> > > (There is no new mmap layer, the higher order pagecache is simply the
> > > old API with set_blocksize expanded).
> >
> > Yes you add another layer in the userspace mapping code to handle higher
> > order pagecache.
>
> That would imply a new API or something? I do not see it.
I was not implying a new API.
> > > Why: It is the same approach that you use.
> >
> > Again, rubbish.
>
> Ok the logical conclusion from the above is that you think your approach
> is rubbish....
The logical conclusion is that _they are not the same approach_!
> Is there some way you could cool down a bit?
I'm not upset, but what you were saying was rubbish, plain and simple. The
amount of times we've gone in circles, I most likely have already explained
this, serveral times, in a more polite manner.
And I know you're more than capable to understand at least the concept
behind fsblock, even without time to work through the exact details. What
are you expecting me to say, after all this back and forth, when you come
up with things like "[fsblock] is not a generic change but special to the
block layer", and then claim that fsblock is the same as allocating "virtual
compound pages" with vmalloc as a fallback for higher order allocs.
What I will say is that fsblock has still a relatively longer way to go, so
maybe that's your reason for not looking at it. And yes, when fsblock is
in a better state to actually perform useful comparisons with it, will be a
much better time to have these debates. But in that case, just say so :)
then I can go away and do more constructive work on it instead of filling
people's inboxes.
I believe the fsblock approach is the best one, but it's not without problems
and complexities, so I'm quite ready for it to be proven incorrect, not
performant, or otherwise rejected.
I'm going on holiday for 2 weeks. I'll try to stay away from email, and
particularly this thread.
On Tue, 18 Sep 2007, Nick Piggin wrote:
>
> ROFL! Yeah of course, how could I have forgotten about our trusty OOM killer
> as the solution to the fragmentation problem? It would only have been funnier
> if you had said to reboot every so often when memory gets fragmented :)
Can we please stop this *idiotic* thread.
Nick, you and some others seem to be arguing based on a totally flawed
base, namely:
- we can guarantee anything at all in the VM
- we even care about the 16kB blocksize
- second-class citizenry is "bad"
The fact is, *none* of those things are true. The VM doesn't guarantee
anything, and is already very much about statistics in many places. You
seem to be arguing as if Christoph was introducing something new and
unacceptable, when it's largely just more of the same.
And the fact is, nobody but SGI customers would ever want the 16kB
blocksize. IOW - NONE OF THIS MATTERS!
Can you guys stop this inane thread already, or at least take it private
between you guys, instead of forcing everybody else to listen in on your
flamefest.
Linus
On Tue, Sep 18, 2007 at 11:30:17AM -0700, Linus Torvalds wrote:
> The fact is, *none* of those things are true. The VM doesn't guarantee
> anything, and is already very much about statistics in many places. You
Many? I can't recall anything besides PF_MEMALLOC and the decision
that the VM is oom. Those are the only two gray areas... the safety
margin is large enough that nobody notices the lack of black-and-white
solution.
So instead of working to provide guarantees for the above two gray
spots, we're making everything weaker, that's the wrong direction as
far as I can tell, especially if we're going to mess up big time the
commo code in a backwards way only for those few users of those few
I/O devices out there.
In general every time reliability has a low priority than performance
I've an hard time to enjoy it.
On Mon, Sep 17, 2007 at 12:56:07AM +0200, Goswin von Brederlow wrote:
> When has free ever given any usefull "free" number? I can perfectly
> fine allocate another gigabyte of memory despide free saing 25MB. But
> that is because I know that the buffer/cached are not locked in.
Well, as you said you know that buffer/cached are not locked in. If
/proc/meminfo would be rubbish like you seem to imply in the first
line, why would we ever bother to export that information and even
waste time writing a binary that parse it for admins?
> On the other hand 1GB can instantly vanish when I start a xen domain
> and anything relying on the free value would loose.
Actually you better check meminfo or free before starting a 1G of Xen!!
> The only sensible thing for an application concerned with swapping is
> to whatch the swapping and then reduce itself. Not the amount
> free. Although I wish there were some kernel interface to get a
> preasure value of how valuable free pages would be right now. I would
> like that for fuse so a userspace filesystem can do caching without
> cripling the kernel.
Repeated drop caches + free can help.
On Tue, 18 Sep 2007, Andrea Arcangeli wrote:
>
> Many? I can't recall anything besides PF_MEMALLOC and the decision
> that the VM is oom.
*All* of the buddy bitmaps, *all* of the GPF_ATOMIC, *all* of the zone
watermarks, everything that we depend on every single day, is in the end
just about statistically workable.
We do 1- and 2-order allocations all the time, and we "know" they work.
Yet Nick (and this whole *idiotic* thread) has all been about how they
cannot work.
> In general every time reliability has a low priority than performance
> I've an hard time to enjoy it.
This is not about performance. Never has been. It's about SGI wanting a
way out of their current 16kB mess.
The way to fix performance is to move to x86-64, and use 4kB pages and be
happy. However, the SGI people want a 16kB (and possibly bigger)
crap-option for their people who are (often _already_) running some
special case situation that nobody else cares about.
It's not about "performance". If it was, they would never have used ia64
in the first place. It's about special-case users that do odd things.
Nobody sane would *ever* argue for 16kB+ blocksizes in general.
Linus
PS. Yes, I realize that there's a lot of insane people out there. However,
we generally don't do kernel design decisions based on them. But we can
pat the insane users on the head and say "we won't guarantee it works, but
if you eat your prozac, and don't bother us, go do your stupid things".
On Tue, 18 Sep 2007, Nick Piggin wrote:
> On Tuesday 18 September 2007 08:00, Christoph Lameter wrote:
> > On Sun, 16 Sep 2007, Nick Piggin wrote:
> > > I don't know how it would prevent fragmentation from building up
> > > anyway. It's commonly the case that potentially unmovable objects
> > > are allowed to fill up all of ram (dentries, inodes, etc).
> >
> > Not in 2.6.23 with ZONE_MOVABLE. Unmovable objects are not allocated from
> > ZONE_MOVABLE and thus the memory that can be allocated for them is
> > limited.
>
> Why would ZONE_MOVABLE require that "movable objects should be moved
> out of the way for unmovable ones"? It never _has_ any unmovable objects in
> it. Quite obviously we were not talking about reserve zones.
This was a response to your statement all of memory could be filled up by unmovable
objects. Which cannot occur if the memory for unmovable objects is
limited. Not sure what you mean by reserves? Mel's reserves? The reserves
for unmovable objects established by ZONE_MOVABLE?
On Tue, 18 Sep 2007, Nick Piggin wrote:
> > We can avoid all doubt in this patchset as well by adding support for
> > fallback to a vmalloced compound page.
>
> How would you do a vmapped fallback in your patchset? How would
> you keep track of pages 2..N if they don't exist in the radix tree?
Through the vmalloc structures and through the conventions established for
compound pages?
> What if they don't even exist in the kernel's linear mapping? It seems
> you would also require more special casing in the fault path and special
> casing in the block layer to do this.
Well yeah there is some sucky part about vmapping things (same as in yours,
possibly more in mine since its general and not specific to the page
cache). On the other hand a generic vcompound fallback will allow us to
use the page allocator in many places where we currently have to use
vmalloc because the allocations are too big. It will allow us to get rid
of most of the vmalloc uses and thereby reduce TLB pressure somewhat.
The vcompound patchset is almost ready..... Maybe bits and pieces may
even help fsblock.
On Wed, 19 Sep 2007, Nathan Scott wrote:
>
> FWIW (and I hate to let reality get in the way of a good conspiracy) -
> all SGI systems have always defaulted to using 4K blocksize filesystems;
Yes. And I've been told that:
> there's very few customers who would use larger
.. who apparently would like to move to x86-64. That was what people
implied at the kernel summit.
> especially as the Linux
> kernel limitations in this area are well known. There's no "16K mess"
> that SGI is trying to clean up here (and SGI have offered both IA64 and
> x86_64 systems for some time now, so not sure how you came up with that
> whacko theory).
Well, if that is the case, then I vote that we drop the whole patch-series
entirely. It clearly has no reason for existing at all.
There is *no* valid reason for 16kB blocksizes unless you have legacy
issues. The performance issues have nothing to do with the block-size, and
should be solvable by just making sure that your stupid "state of the art"
crap SCSI controller gets contiguous physical memory, which is best done
in the read-ahead code.
So get your stories straight, people.
Linus
On Tue, 2007-09-18 at 12:44 -0700, Linus Torvalds wrote:
> This is not about performance. Never has been. It's about SGI wanting a
> way out of their current 16kB mess.
Pass the crack pipe, Linus?
> The way to fix performance is to move to x86-64, and use 4kB pages and be
> happy. However, the SGI people want a 16kB (and possibly bigger)
> crap-option for their people who are (often _already_) running some
> special case situation that nobody else cares about.
FWIW (and I hate to let reality get in the way of a good conspiracy) -
all SGI systems have always defaulted to using 4K blocksize filesystems;
there's very few customers who would use larger, especially as the Linux
kernel limitations in this area are well known. There's no "16K mess"
that SGI is trying to clean up here (and SGI have offered both IA64 and
x86_64 systems for some time now, so not sure how you came up with that
whacko theory).
> It's not about "performance". If it was, they would never have used ia64
For SGI it really is about optimising ondisk layouts for some workloads
and large filesystems, and has nothing to do with IA64. Read the paper
Dave sent out earlier, it's quite interesting.
For other people, like AntonA, who has also been asking for this
functionality literally for years (and ended up trying to do his own
thing inside NTFS IIRC) it's to be able to access existing filesystems
from other operating systems. Here's a more recent discussion, I know
Anton had discussed it several times on fsdevel before this 2005 post
too: http://oss.sgi.com/archives/xfs/2005-01/msg00126.html
Although I'm sure others exist, I've never worked on any platform other
than Linux that doesn't support filesystem block sizes larger than the
pagesize. Its one thing to stick your head in the sand about the need
for this feature, its another thing entirely to try pass it off as an
"SGI mess", sorry.
I do entirely support the sentiment to stop this pissing match and get
on with fixing the problem though.
cheers.
--
Nathan
On Tue, 2007-09-18 at 18:06 -0700, Linus Torvalds wrote:
> There is *no* valid reason for 16kB blocksizes unless you have legacy
> issues.
That's not correct.
> The performance issues have nothing to do with the block-size, and
We must be thinking of different performance issues.
> should be solvable by just making sure that your stupid "state of the
> art"
> crap SCSI controller gets contiguous physical memory, which is best
> done
> in the read-ahead code.
SCSI controllers have nothing to do with improving ondisk layout, which
is the performance issue I've been referring to.
cheers.
--
Nathan
On 09/18/2007 09:44 PM, Linus Torvalds wrote:
> Nobody sane would *ever* argue for 16kB+ blocksizes in general.
Well, not so sure about that. What if one of your expected uses for example
is video data storage -- lots of data, especially for multiple streams, and
needs still relatively fast machinery. Why would you care for the overhead
af _small_ blocks?
Okay, maybe that's covered in the "in general" but its not extremely oddball
either...
Rene.
On Wed, 19 Sep 2007, Rene Herman wrote:
>
> Well, not so sure about that. What if one of your expected uses for example is
> video data storage -- lots of data, especially for multiple streams, and needs
> still relatively fast machinery. Why would you care for the overhead af
> _small_ blocks?
.. so work with an extent-based filesystem instead.
16k blocks are total idiocy. If this wasn't about a "support legacy
customers", I think the whole patch-series has been a total waste of time.
Linus
On 09/19/2007 05:50 AM, Linus Torvalds wrote:
> On Wed, 19 Sep 2007, Rene Herman wrote:
>> Well, not so sure about that. What if one of your expected uses for example is
>> video data storage -- lots of data, especially for multiple streams, and needs
>> still relatively fast machinery. Why would you care for the overhead af
>> _small_ blocks?
>
> .. so work with an extent-based filesystem instead.
>
> 16k blocks are total idiocy. If this wasn't about a "support legacy
> customers", I think the whole patch-series has been a total waste of time.
Admittedly, extent-based might not be a particularly bad answer at least to
the I/O side of the equation...
I do feel larger blocksizes continue to make sense in general though. Packet
writing on CD/DVD is a problem already today since the hardware needs 32K or
64K blocks and I'd expect to see more of these and similiar situations when
flash gets (even) more popular which it sort of inevitably is going to be.
Rene.
On Wed, 19 Sep 2007, Rene Herman wrote:
>
> I do feel larger blocksizes continue to make sense in general though. Packet
> writing on CD/DVD is a problem already today since the hardware needs 32K or
> 64K blocks and I'd expect to see more of these and similiar situations when
> flash gets (even) more popular which it sort of inevitably is going to be.
.. that's what scatter-gather exists for.
What's so hard with just realizing that physical memory isn't contiguous?
It's why we have MMU's. It's why we have scatter-gather.
Linus
On 09/19/2007 06:33 AM, Linus Torvalds wrote:
> On Wed, 19 Sep 2007, Rene Herman wrote:
>> I do feel larger blocksizes continue to make sense in general though. Packet
>> writing on CD/DVD is a problem already today since the hardware needs 32K or
>> 64K blocks and I'd expect to see more of these and similiar situations when
>> flash gets (even) more popular which it sort of inevitably is going to be.
>
> .. that's what scatter-gather exists for.
>
> What's so hard with just realizing that physical memory isn't contiguous?
>
> It's why we have MMU's. It's why we have scatter-gather.
So if I understood that right, you'd suggest to deal with devices with
larger physical blocksizes at some level above the current blocklayer.
Not familiar enough with either block or fs to be able to argue that
effectively...
Rene.
On Tue, Sep 18, 2007 at 06:06:52PM -0700, Linus Torvalds wrote:
> > especially as the Linux
> > kernel limitations in this area are well known. There's no "16K mess"
> > that SGI is trying to clean up here (and SGI have offered both IA64 and
> > x86_64 systems for some time now, so not sure how you came up with that
> > whacko theory).
>
> Well, if that is the case, then I vote that we drop the whole patch-series
> entirely. It clearly has no reason for existing at all.
>
> There is *no* valid reason for 16kB blocksizes unless you have legacy
> issues.
Ok, let's step back for a moment and look at a basic, fundamental
constraint of disks - seek capacity. A decade ago, a terabyte of
filesystem had 30 disks behind it - a seek capacity of about
6000 seeks/s. Nowdays, that's a single disk with a seek
capacity of about 200/s. We're going *rapidly* backwards in
terms of seek capacity per terabyte of storage.
Now fill that terabyte of storage and index it in the most efficient
way - let's say btrees are used because lots of filesystems use
them. Hence the depth of the tree is roughly O((log n)/m) where m is
a factor of the btree block size. Effectively, btree depth = seek
count on lookup of any object.
When the filesystem had a capacity of 6,000 seeks/s, we didn't
really care if the indexes used 4k blocks or not - the storage
subsystem had an excess of seek capacity to deal with
less-than-optimal indexing. Now we have over an order of magnitude
less seeks to expend in index operations *for the same amount of
data* so we are really starting to care about minimising the
number of seeks in our indexing mechanisms and allocations.
We can play tricks in index compaction to reduce the number of
interior nodes of the tree (like hashed indexing in the XFS ext3
htree directories) but that still only gets us so far in reducing
seeks and doesn't help at all for tree traversals. That leaves us
with the btree block size as the only factor we can further vary to
reduce the depth of the tree. i.e. "m".
So we want to increase the filesystem block size it improve the
efficiency of our indexing. That improvement in efficiency
translates directly into better performance on seek constrained
storage subsystems.
The problem is this: to alter the fundamental block size of the
filesystem we also need to alter the data block size and that is
exactly the piece that linux does not support right now. So while
we have the capability to use large block sizes in certain
filesystems, we can't use that capability until the data path
supports it.
To summarise, large block size support in the filesystem is not
about "legacy" issues. It's about trying to cope with the rapid
expansion of storage capabilities of modern hardware where we have
to index much, much more data with a corresponding decrease in
the seek capability of the hardware.
> So get your stories straight, people.
Ok, so let's set the record straight. There were 3 justifications
for using *large pages* to *support* large filesystem block sizes
The justifications for the variable order page cache with large
pages were:
1. little code change needed in the filesystems
-> still true
2. Increased I/O sizes on 4k page machines (the "SCSI
controller problem")
-> redundant thanks to Jens Axboe's quick work
3. avoiding the need for vmap() as it has great
overhead and does not scale
-> Nick is starting to work on that and has
already had good results.
Everyone seems to be focussing on #2 as the entire justification for
large block sizes in filesystems and that this is an "SGI" problem.
Nothing could be further from the truth - the truth is that large
pages solved multiple problems in one go. We now have a different,
better solution #2, so please, please stop using that as some
justification for claiming filesystems don't need large block sizes.
However, all this doesn't change the fact that we have a major storage
scalability crunch coming in the next few years. Disk capacity is
likely to continue to double every 12 months for the next 3 or 4
years. Large block size support is only one mechanism we need to
help cope with this trend.
The variable order page cache with large pages was a means to an end
- it's not the only solution to this problem and I'm extremely happy
to see that there is progress on multiple fronts. That's the
strength of the Linux community showing through. In the end, I
really don't care how we end up supporting large filesystem block
sizes in the page cache - all I care about is that we end up
supporting it as efficiently and generically as we possibly can.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On 9/19/07, David Chinner <[email protected]> wrote:
> The problem is this: to alter the fundamental block size of the
> filesystem we also need to alter the data block size and that is
> exactly the piece that linux does not support right now. So while
> we have the capability to use large block sizes in certain
> filesystems, we can't use that capability until the data path
> supports it.
it's much simpler to teach fs to understand multipage data (like
multipage bitmap scan, multipage extent search, etc) then deal with mm
fragmentation. IMHO. at same time you don't bust IO traffic with
non-used space.
--
thanks, Alex
On Wednesday 19 September 2007 04:30, Linus Torvalds wrote:
> On Tue, 18 Sep 2007, Nick Piggin wrote:
> > ROFL! Yeah of course, how could I have forgotten about our trusty OOM
> > killer as the solution to the fragmentation problem? It would only have
> > been funnier if you had said to reboot every so often when memory gets
> > fragmented :)
>
> Can we please stop this *idiotic* thread.
>
> Nick, you and some others seem to be arguing based on a totally flawed
> base, namely:
> - we can guarantee anything at all in the VM
> - we even care about the 16kB blocksize
> - second-class citizenry is "bad"
>
> The fact is, *none* of those things are true. The VM doesn't guarantee
> anything, and is already very much about statistics in many places. You
> seem to be arguing as if Christoph was introducing something new and
> unacceptable, when it's largely just more of the same.
I will stop this idiotic thread.
However, at the VM and/or vm/fs things we had, I was happy enough
for this thing of Christoph's to get merged. Actually I even didn't care
if it had mmap support, so long as it solved their problem.
But a solution to the general problem of VM and IO scalability, it is not.
IMO.
> And the fact is, nobody but SGI customers would ever want the 16kB
> blocksize. IOW - NONE OF THIS MATTERS!
Maybe. Maybe not.
> Can you guys stop this inane thread already, or at least take it private
> between you guys, instead of forcing everybody else to listen in on your
> flamefest.
Will do. Sorry.
On Wed, Sep 19, 2007 at 03:09:10PM +1000, David Chinner wrote:
> Ok, let's step back for a moment and look at a basic, fundamental
> constraint of disks - seek capacity. A decade ago, a terabyte of
> filesystem had 30 disks behind it - a seek capacity of about
> 6000 seeks/s. Nowdays, that's a single disk with a seek
> capacity of about 200/s. We're going *rapidly* backwards in
> terms of seek capacity per terabyte of storage.
>
> Now fill that terabyte of storage and index it in the most efficient
> way - let's say btrees are used because lots of filesystems use
> them. Hence the depth of the tree is roughly O((log n)/m) where m is
> a factor of the btree block size. Effectively, btree depth = seek
> count on lookup of any object.
I agree. btrees will clearly benefit if the nodes are larger. We've an
excess of disk capacity and an huge gap between seeking and contiguous
bandwidth.
You don't need largepages for this, fsblocks is enough.
Largepages for you are a further improvement to avoid reducing the SG
entries and potentially reducing the cpu utilization a bit (not much
though, only the pagecache works with largepages and especially with
small sized random I/O you'll be taking the radix tree lock the same
number of times...).
Plus of course you don't like fsblock because it requires work to
adapt a fs to it, I can't argue about that.
> Ok, so let's set the record straight. There were 3 justifications
> for using *large pages* to *support* large filesystem block sizes
> The justifications for the variable order page cache with large
> pages were:
>
> 1. little code change needed in the filesystems
> -> still true
Disagree, the mmap side is not a little change. If you do it just for
the not-mmapped I/O that truly is an hack, but then frankly I would
prefer only the read/write hack (without mmap) so it will not reject
heavily with my stuff and it'll be quicker to nuke it out of the
kernel later.
> 3. avoiding the need for vmap() as it has great
> overhead and does not scale
> -> Nick is starting to work on that and has
> already had good results.
Frankly I don't follow this vmap thing. Can you elaborate? Is this
about allowing the blkdev pagecache for metadata to go in highmemory?
Is that the kmap thing? I think we can stick to a direct mapped b_data
and avoid all overhead of converting a struct page to a virtual
address. It takes the same 64bit size anyway in ram and we avoid one
layer of indirection and many modifications. If we wanted to switch to
kmap for blkdev pagecache we should have done years ago, now it's far
too late to worry about it.
> Everyone seems to be focussing on #2 as the entire justification for
> large block sizes in filesystems and that this is an "SGI" problem.
I agree it's not a SGI problem and this is why I want a design that
has a _slight chance_ to improve performance on x86-64 too. If
variable order page cache will provide any further improvement on top
of fsblock will be only because your I/O device isn't fast with small
sg entries.
For the I/O layout the fsblock is more than enough, but I don't think
your variable order page cache will help in any significant way on
x86-64. Furthermore the complexity of handle page faults on largepages
is almost equivalent to the complexity of config-page-shift, but
config-page-shift gives you the whole cpu-saving benefits that you can
never remotely hope to achieve with variable order page cache.
config-page-shift + fsblock IMHO is the way to go for x86-64, with one
additional 64k PAGE_SIZE rpm. config-page-shift will stack nicely on
top of fsblocks.
fsblock will provide the guarantee of "mounting" all fs anywhere no
matter which config-page-shift you selected at compile time, as well
as dvd writing. Then config-page-shift will provide the cpu
optimization on all fronts, not just for the pagecache I/O for the
large ram systems, without fragmentation issues and with 100%
reliability in the "free" numbers (not working by luck). That's all we
need as far as I can tell.
On Wed, Sep 19, 2007 at 04:04:30PM +0200, Andrea Arcangeli wrote:
> On Wed, Sep 19, 2007 at 03:09:10PM +1000, David Chinner wrote:
> > Ok, let's step back for a moment and look at a basic, fundamental
> > constraint of disks - seek capacity. A decade ago, a terabyte of
> > filesystem had 30 disks behind it - a seek capacity of about
> > 6000 seeks/s. Nowdays, that's a single disk with a seek
> > capacity of about 200/s. We're going *rapidly* backwards in
> > terms of seek capacity per terabyte of storage.
> >
> > Now fill that terabyte of storage and index it in the most efficient
> > way - let's say btrees are used because lots of filesystems use
> > them. Hence the depth of the tree is roughly O((log n)/m) where m is
> > a factor of the btree block size. Effectively, btree depth = seek
> > count on lookup of any object.
>
> I agree. btrees will clearly benefit if the nodes are larger. We've an
> excess of disk capacity and an huge gap between seeking and contiguous
> bandwidth.
>
> You don't need largepages for this, fsblocks is enough.
Sure, and that's what I meant when I said VPC + large pages was
a means to the end, not the only solution to the problem.
> Plus of course you don't like fsblock because it requires work to
> adapt a fs to it, I can't argue about that.
No, I don't like fsblock because it is inherently a "struture
per filesystem block" construct, just like buggerheads. You
still need to allocate millions of them when you have millions
dirty pages around. Rather than type it all out again, read
the fsblocks thread from here:
http://marc.info/?l=linux-fsdevel&m=118284983925719&w=2
FWIW, with Chris mason's extent-based block mapping (which btrfs
is using and Christoph Hellwig is porting XFS over to) we completely
remove buggerheads from XFS and so fsblock would be a pretty
major step backwards for us if Chris's work goes into mainline.
> > Ok, so let's set the record straight. There were 3 justifications
> > for using *large pages* to *support* large filesystem block sizes
> > The justifications for the variable order page cache with large
> > pages were:
> >
> > 1. little code change needed in the filesystems
> > -> still true
>
> Disagree, the mmap side is not a little change.
That's not in the filesystem, though. ;)
However, I agree that if you don't have mmap then it's not
worthwhile and the changes for VPC aren't trivial.
> > 3. avoiding the need for vmap() as it has great
> > overhead and does not scale
> > -> Nick is starting to work on that and has
> > already had good results.
>
> Frankly I don't follow this vmap thing. Can you elaborate?
We current support metadata blocks larger than page size for
certain types of metadata in XFS. e.g. directory blocks.
This however, requires vmap()ing a bunch of individual,
non-contiguous pages out of a block device address space
in exactly the fashion that was proposed by Nick with fsblock
originally.
vmap() has severe scalability problems - read this subthread
of this discussion between Nick and myself:
http://lkml.org/lkml/2007/9/11/508
> > Everyone seems to be focussing on #2 as the entire justification for
> > large block sizes in filesystems and that this is an "SGI" problem.
>
> I agree it's not a SGI problem and this is why I want a design that
> has a _slight chance_ to improve performance on x86-64 too. If
> variable order page cache will provide any further improvement on top
> of fsblock will be only because your I/O device isn't fast with small
> sg entries.
<sigh>
There we go - back to the bloody I/O devices. Can ppl please stop
bringing this up because it *is not an issue any more*.
> config-page-shift + fsblock IMHO is the way to go for x86-64, with one
> additional 64k PAGE_SIZE rpm. config-page-shift will stack nicely on
> top of fsblocks.
Hmm - so you'll need page cache tail packing as well in that case
to prevent memory being wasted on small files. That means any way
we look at it (VPC+mmap or config-page-shift+fsblock+pctails)
we've got some non-trivial VM modifications to make.
If VPC can be separated from the large contiguous page requirement
(i.e. virtually mapped compound page support), I still think it
comes out on top because it doesn't require every filesystem to be
modified and you can use standard pages where they are optimal
(i.e. on filesystems were block size <= PAGE_SIZE).
But, I'm not going to argue endlessly for one solution or another;
I'm happy to see different solutions being chased, so may the
best VM win ;)
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Thu, Sep 20, 2007 at 11:38:21AM +1000, David Chinner wrote:
> Sure, and that's what I meant when I said VPC + large pages was
> a means to the end, not the only solution to the problem.
The whole point is that it's not an end, it's an end to your own fs
centric view only (which is sure fair enough), but I watch the whole
VM not just the pagecache...
The same way the fs-centric view will hope to get this little bit of
further optimization from largepages to reach "the end", my VM-wide
view wants the same little bit of opitmization for *everything*
including tmpfs and anonymous memory, slab etc..! This is clearly why
config-page-shift is better...
If you're ok not to be on the edge and you want a generic rpm image
that run quite optimally for any workload, then 4k+fslblock is just
fine of course. But if we go on the edge we should aim for the _very_
end for the whole VM, not just for "the end of the pagecache on
certain files". Especially when the complexity involved in the mmap
code is similar, and it will reject heavily if we merge this
not-very-end solution that only reaches "the end" for the pagecache.
> No, I don't like fsblock because it is inherently a "struture
> per filesystem block" construct, just like buggerheads. You
> still need to allocate millions of them when you have millions
> dirty pages around. Rather than type it all out again, read
> the fsblocks thread from here:
>
> http://marc.info/?l=linux-fsdevel&m=118284983925719&w=2
Thanks for the pointer!
> FWIW, with Chris mason's extent-based block mapping (which btrfs
> is using and Christoph Hellwig is porting XFS over to) we completely
> remove buggerheads from XFS and so fsblock would be a pretty
> major step backwards for us if Chris's work goes into mainline.
I tend to agree if we change it fsblock should support extent if
that's what you need on xfs to support range-locking etc... Whatever
happens in vfs should please all existing fs without people needing to
go their own way again... Or replace fsblock with Chris's block
mapping. Frankly I didn't see Chris's code so I cannot comment
further. But your complains sounds sensible. We certainly want to
avoid lowlevel fs to get smarter again than the vfs. The brainer stuff
should be in vfs!
> That's not in the filesystem, though. ;)
>
> However, I agree that if you don't have mmap then it's not
> worthwhile and the changes for VPC aren't trivial.
Yep.
>
> > > 3. avoiding the need for vmap() as it has great
> > > overhead and does not scale
> > > -> Nick is starting to work on that and has
> > > already had good results.
> >
> > Frankly I don't follow this vmap thing. Can you elaborate?
>
> We current support metadata blocks larger than page size for
> certain types of metadata in XFS. e.g. directory blocks.
> This however, requires vmap()ing a bunch of individual,
> non-contiguous pages out of a block device address space
> in exactly the fashion that was proposed by Nick with fsblock
> originally.
>
> vmap() has severe scalability problems - read this subthread
> of this discussion between Nick and myself:
>
> http://lkml.org/lkml/2007/9/11/508
So the idea of vmap is that it's much simpler to have a contiguous
virtual address space large blocksize, than to find the right
b_data[index] once you exceed PAGE_SIZE...
The global tlb flush with ipi would kill performance, you can forget
any global mapping here. The only chance to do this would be like we
do with kmap_atomic per-cpu on highmem, with preempt_disable (for the
enjoyment of the rt folks out there ;). what's the problem of having
it per-cpu? Is this what fsblock already does? You've just have to
allocate a new virtual range large numberofentriesinvmap*blocksize
every time you mount a new fs. Then instead of calling kmap you call
vmap and vunmap when you're finished. That should provide decent
performance, especially with physically indexed caches.
Anything more heavyweight than what I suggested is probably overkill,
even vmalloc_to_page.
> Hmm - so you'll need page cache tail packing as well in that case
> to prevent memory being wasted on small files. That means any way
> we look at it (VPC+mmap or config-page-shift+fsblock+pctails)
> we've got some non-trivial VM modifications to make.
Hmm no, the point of config-page-shift is that if you really need to
reach "the very end", you probably don't care about wasting some
memory, because either your workload can't fit in cache, or it fits in
cache regardless, or you're not wasting memory because you work with
large files...
The only point of this largepage stuff is to go an extra mile to save
a bit more of cpu vs a strict vmap based solution (fsblock of course
will be smart enough that if it notices the PAGE_SIZE is >= blocksize
it doesn't need to run any vmap at all and it can just use the direct
mapping, so vmap translates in 1 branch only to check the blocksize
variable, PAGE_SIZE is immediate in the .text at compile time). But if
you care about that tiny bit of performance during I/O operations
(variable order page cache only gives the tinybit of performance
during read/write syscalls!!!), then it means you actually want to
save CPU _everywhere_ not just in read/write and while mangling
metadata in the lowlevel fs. And that's what config-page-shift should
provide...
This is my whole argument for preferring config-page-shift+fsblock (or
whatever else fsblock replacement but then Nick design looked quite
sensible to me, if integrated with extent based locking, without
having seen Chris's yet). Regardless of the fact config-page-shift
also has the other benefit of providing guarantees for meminfo levels
and the other fact it doesn't strictly require defrag heuristics to
avoid hitting worst case huge-ram-waste scenarios.
> But, I'm not going to argue endlessly for one solution or another;
> I'm happy to see different solutions being chased, so may the
> best VM win ;)
;)
On Thu, 20 Sep 2007, David Chinner wrote:
> > Disagree, the mmap side is not a little change.
>
> That's not in the filesystem, though. ;)
And its really only a minimal change for some function to loop over all
4k pages and elsewhere index the right 4k subpage.
On Thu, 20 Sep 2007, Andrea Arcangeli wrote:
> The only point of this largepage stuff is to go an extra mile to save
> a bit more of cpu vs a strict vmap based solution (fsblock of course
> will be smart enough that if it notices the PAGE_SIZE is >= blocksize
> it doesn't need to run any vmap at all and it can just use the direct
> mapping, so vmap translates in 1 branch only to check the blocksize
> variable, PAGE_SIZE is immediate in the .text at compile time). But if
Hmmm.. You are not keeping up with things? Heard of virtual compound
pages? They only require a vmap when the page allocator fails a
larger order allocation (which we have established is rare to the point
of nonexistence). The next rev of large blocksize will use that for
fallback. Plus vcompounds can be used to get rid of most uses of vmalloc
reducing the need for virtual mapping in general.
Largeblock is a general solution for managing large data sets using a
single page struct. See the original message that started this thread. It
can be used for various other subsystems like vcompound can.
On Thu, 20 Sep 2007, Christoph Lameter wrote:
> On Thu, 20 Sep 2007, David Chinner wrote:
> > > Disagree, the mmap side is not a little change.
> >
> > That's not in the filesystem, though. ;)
>
> And its really only a minimal change for some function to loop over all
> 4k pages and elsewhere index the right 4k subpage.
I agree with you on that: the changes you had to make to support mmap
were _much_ less bothersome than I'd been fearing, and I'm surprised
some people still see that side of it as a sticking point.
But I've kept very quiet because I remain quite ambivalent about the
patchset: I'm somewhere on the spectrum between you and Nick, shifting
my position from hour to hour. Don't expect any decisiveness from me.
In some senses I'm even further off the scale away from you: I'm
dubious even of Nick and Andrew's belief in opportunistic contiguity.
Just how hard should we trying for contiguity? How far should we
go in sacrificing our earlier "LRU" principles? It's easy to bump
up PAGE_ALLOC_COSTLY_ORDER, but what price do we pay when we do?
I agree with those who want to see how the competing approaches
work out in practice: which is frustrating for you, yes, because
you are so close to ready. (I've not glanced at virtual compound,
but had been wondering in that direction before you suggested it.)
I do think your patchset is, for the time being at least, a nice
out-of-tree set, and it's grand to be able to bring a filesystem
from another arch with larger pagesize and get at the data from it.
I've found some fixes needed on top of your Large Blocksize Support
patches: I'll send those to you in a moment. Looks like you didn't
try much swapping!
I only managed to get ext2 working with larger blocksizes:
reiserfs -b 8192 wouldn't mount ("reiserfs_fill_super: can not find
reiserfs on /dev/sdb1"); ext3 gave me mysterious errors ("JBD: tar
wants too many credits", even after adding JBD patches that you
turned out to be depending on); and I didn't try ext4 or xfs
(I'm guessing the latter has been quite well tested ;)
Hugh
[email protected] (Mel Gorman) writes:
> On (16/09/07 23:31), Andrea Arcangeli didst pronounce:
>> On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
>> Allocating ptes from slab is fairly simple but I think it would be
>> better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
>> nearby ptes in the per-task local pagetable tree, to reduce the number
>> of locks taken and not to enter the slab at all for that.
>
> It runs the risk of pinning up to 60K of data per task that is unusable for
> any other purpose. On average, it'll be more like 32K but worth keeping
> in mind.
Two things to both of you respectively.
Why should we try to stay out of the pte slab? Isn't the slab exactly
made for this thing? To efficiently handle a large number of equal
size objects for quick allocation and dealocation? If it is a locking
problem then there should be a per cpu cache of ptes. Say 0-32
ptes. If you run out you allocate 16 from slab. When you overflow you
free 16 (which would give you your 64k allocations but in multiple
objects).
As for the wastage. Every pte can map 2MB on amd64, 4MB on i386, 8MB
on sparc (?). A 64k pte chunk would be 32MB, 64MB and 32MB (?)
respectively. For the sbrk() and mmap() usage from glibc malloc() that
would be fine as they grow linear and the mmap() call in glibc could
be made to align to those chunks. But for a programm like rtorrent
using mmap to bring in chunks of a 4GB file this looks desasterous.
>> Infact we
>> could allocate the 4 levels (or anyway more than one level) in one
>> single alloc_pages(0) and track the leftovers in the mm (or similar).
Personally I would really go with a per cpu cache. When mapping a page
reserve 4 tables. Then you walk the tree and add entries as
needed. And last you release 0-4 unused entries to the cache.
MfG
Goswin
[email protected] (Mel Gorman) writes:
> On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
>> But when you already have say 10% of the ram in mixed groups then it
>> is a sign the external fragmentation happens and some time should be
>> spend on moving movable objects.
>>
>
> I'll play around with it on the side and see what sort of results I get.
> I won't be pushing anything any time soon in relation to this though.
> For now, I don't intend to fiddle more with grouping pages by mobility
> for something that may or may not be of benefit to a feature that hasn't
> been widely tested with what exists today.
I watched the videos you posted. A ncie and quite clear improvement
with and without your logic. Cudos.
When you play around with it may I suggest a change to the display of
the memory information. I think it would be valuable to use a Hilbert
Curve to arange the pages into pixels. Like this:
# # 0 3
# #
### 1 2
### ### 0 1 E F
# #
### ### 3 2 D C
# #
# ### # 4 7 8 B
# # # #
### ### 5 6 9 A
+-----------+-----------+
# ##### ##### # |00 03 04 05|3A 3B 3C 3F|
# # # # # # | | |
### ### ### ### |01 02 07 06|39 38 3D 3E|
# # | | |
### ### ### ### |0E 0D 08 09|36 37 32 31|
# # # # # # | | |
# ##### ##### # |0F 0C 0B 0A|35 34 33 30|
# # +-----+-----+ |
### ####### ### |10 11|1E 1F|20 21 2E 2F|
# # # # | | | |
### ### ### ### |13 12|1D 1C|23 22 2D 2C|
# # # # | +-----+ |
# ### # # ### # |14 17|18 1B|24 27 28 2B|
# # # # # # # # | | | |
### ### ### ### |15 16|19 1A|25 26 29 2A|
+-----+-----+-----------+
I've drawn in allocations for 16, 8, 4, 5, 32 pages in that order in
the last one. The idea is to get near pages visually near in the
output and into an area instead of lines. Easier on the eye. It also
manages to always draw aligned order(x) blocks as squares or rectanges
(even or odd order).
>> Maybe instead of reserving one could say that you can have up to 6
>> groups of space
>
> And if the groups are 1GB in size? I tried something like this already.
> It didn't work out well at the time although I could revisit.
You adjust group size with the number of groups total. You would not
use 1GB Huge Pages on a 2GB ram system. You could try 2MB groups. I
think for most current systems we are lucky there. 2MB groups fit
hardware support and give a large but not too large number of groups
to work with.
But you only need to stick to hardware suitable group sizes for huge
tlb support right? For better I/O and such you could have 512Kb groups
if that size gives a reasonable number of groups total.
>> not used by unmovable objects before aggressive moving
>> starts. I don't quite see why you NEED reserving as long as there is
>> enough space free alltogether in case something needs moving.
>
> hence, increase min_free_kbytes.
Which is different from reserving a full group as it does not count
fragmented space as lost.
>> 1 group
>> worth of space free might be plenty to move stuff too. Note that all
>> the virtual pages can be stuffed in every little free space there is
>> and reassembled by the MMU. There is no space lost there.
>>
>
> What you suggest sounds similar to having a type MIGRATE_MIXED where you
> allocate from when the preferred lists are full. It became a sizing
> problem that never really worked out. As I said, I can try again.
Not realy. I'm saying we should actively defragment mixed groups
during allocation and always as little as possible when a certain
level of external fragmentation is reached. A MIGRATE_MIXED sounds
like giving up completly if things get bad enough. Compare it to a
cheap network switch going into hub mode when its arp table runs full.
If you ever had that then you know how bad that is.
>> But until one tries one can't say.
>>
>> MfG
>> Goswin
>>
>> PS: How do allocations pick groups?
>
> Using GFP flags to identify the type.
That is the type of group, not which one.
>> Could one use the oldest group
>> dedicated to each MIGRATE_TYPE?
>
> Age is difficult to determine so probably not.
Put the uptime as sort key into each group header on creation or type
change. Then sort the partialy used groups by that key. A heap will do
fine and be fast.
>> Or lowest address for unmovable and
>> highest address for movable? Something to better keep the two out of
>> each other way.
>
> We bias the location of unmovable and reclaimable allocations already. It's
> not done for movable because it wasn't necessary (as they are easily
> reclaimed or moved anyway).
Except that is never done so doesn't count.
MfG
Goswin
[email protected] (Mel Gorman) writes:
> On (17/09/07 00:38), Goswin von Brederlow didst pronounce:
>> [email protected] (Mel Gorman) writes:
>>
>> > On (15/09/07 02:31), Goswin von Brederlow didst pronounce:
>> >> Mel Gorman <[email protected]> writes:
>> >> Looking at my
>> >> little test program evicting movable objects from a mixed group should
>> >> not be that expensive as it doesn't happen often.
>> >
>> > It happens regularly if the size of the block you need to keep clean is
>> > lower than min_free_kbytes. In the case of hugepages, that was always
>> > the case.
>>
>> That assumes that the number of groups allocated for unmovable objects
>> will continiously grow and shrink.
>
> They do grow and shrink. The number of pagetables in use changes for
> example.
By numbers of groups worth? And full groups get free, unmixed and
filled by movable objects?
>> I'm assuming it will level off at
>> some size for long times (hours) under normal operations.
>
> It doesn't unless you assume the system remains in a steady state for it's
> lifetime. Things like updatedb tend to throw a spanner into the works.
Moved to cron weekly here. And even normally it is only once a day. So
what if it starts moving some pages while updatedb runs. If it isn't
too braindead it will reclaim some dentries updated has created and
left for good. It should just cause the dentry cache to be smaller at
no cost. I'm not calling that normal operations. That is a once a day
special. What I don't want is to spend 1% of cpu time copying
pages. That would be unacceptable. Copying 1000 pages per updatedb run
would be trivial on the other hand.
>> There should
>> be some buffering of a few groups to be held back in reserve when it
>> shrinks to prevent the scenario that the size is just at a group
>> boundary and always grows/shrinks by 1 group.
>>
>
> And what size should this group be that all workloads function?
1 is enough to prevent jittering. If you don't hold a group back and
you are exactly at a group boundary then alternatingly allocating and
freeing one page would result in a group allocation and freeing every
time. With one group in reserve you only get an group allocation or
freeing when a groupd worth of change has happened.
This assumes that changing the type and/or state of a group is
expensive. Takes time or locks or some such. Otherwise just let it
jitter.
>> >> So if
>> >> you evict movable objects from mixed group when needed all the
>> >> pagetable pages would end up in the same mixed group slowly taking it
>> >> over completly. No fragmentation at all. See how essential that
>> >> feature is. :)
>> >>
>> >
>> > To move pages, there must be enough blocks free. That is where
>> > min_free_kbytes had to come in. If you cared only about keeping 64KB
>> > chunks free, it makes sense but it didn't in the context of hugepages.
>>
>> I'm more concerned with keeping the little unmovable things out of the
>> way. Those are the things that will fragment the memory and prevent
>> any huge pages to be available even with moving other stuff out of the
>> way.
>
> That's fair, just not cheap
That is the price you pay. To allocate 2MB of ram you have to have 2MB
of free ram or make them free. There is no way around that. Moving
pages means that you can actually get those 2MB even if the price
is high and that you have more choice deciding what to throw away or
swap out. I would rather have a 2MB malloc take some time than have it
fail because the kernel doesn't feel like it.
>> Can you tell me how? I would like to do the same.
>>
>
> They were generated using trace_allocmap kernel module in
> http://www.csn.ul.ie/~mel/projects/vmregress/vmregress-0.81-rc2.tar.gz
> in combination with frag-display in the same package. However, in the
> current version against current -mm's, it'll identify some movable pages
> wrong. Specifically, it will appear to be mixing movable pages with slab
> pages and it doesn't identify SLUB pages properly at all (SLUB came after
> the last revision of this tool). I need to bring an annotation patch up to
> date before it can generate the images correctly.
Thanks. I will test that out and see what I get on a few lustre
servers and clients. That is probably quite a different workload from
what you test.
MfG
Goswin
Andrea Arcangeli <[email protected]> writes:
> On Mon, Sep 17, 2007 at 12:56:07AM +0200, Goswin von Brederlow wrote:
>> When has free ever given any usefull "free" number? I can perfectly
>> fine allocate another gigabyte of memory despide free saing 25MB. But
>> that is because I know that the buffer/cached are not locked in.
>
> Well, as you said you know that buffer/cached are not locked in. If
> /proc/meminfo would be rubbish like you seem to imply in the first
> line, why would we ever bother to export that information and even
> waste time writing a binary that parse it for admins?
As a user I know it because I didn't put a kernel source into /tmp. A
programm can't reasonably know that.
>> On the other hand 1GB can instantly vanish when I start a xen domain
>> and anything relying on the free value would loose.
>
> Actually you better check meminfo or free before starting a 1G of Xen!!
Xen has its own memory pool and can quite agressively reclaim memory
from dom0 when needed. I just ment to say that the number in
/proc/meminfo can change in a second so it is not much use knowing
what it said last minute.
>> The only sensible thing for an application concerned with swapping is
>> to whatch the swapping and then reduce itself. Not the amount
>> free. Although I wish there were some kernel interface to get a
>> preasure value of how valuable free pages would be right now. I would
>> like that for fuse so a userspace filesystem can do caching without
>> cripling the kernel.
>
> Repeated drop caches + free can help.
I would kill any programm that does that to find out how much free ram
the system has.
MfG
Goswin
On Sun, 16 September 2007 11:44:09 -0700, Linus Torvalds wrote:
> On Sun, 16 Sep 2007, Jörn Engel wrote:
> >
> > My approach is to have one for mount points and ramfs/tmpfs/sysfs/etc.
> > which are pinned for their entire lifetime and another for regular
> > files/inodes. One could take a three-way approach and have
> > always-pinned, often-pinned and rarely-pinned.
> >
> > We won't get never-pinned that way.
>
> That sounds pretty good. The problem, of course, is that most of the time,
> the actual dentry allocation itself is done before you really know which
> case the dentry will be in, and the natural place for actually giving the
> dentry lifetime hint is *not* at "d_alloc()", but when we "instantiate"
> it with d_add() or d_instantiate().
>
> [...]
>
> And yes, you'd end up with the reallocation overhead quite often, but at
> least it would now happen only when filling in a dentry, not in the
> (*much* more critical) cached lookup path.
There may be another approach. We could create a never-pinned cache,
without trying hard to keep it full. Instead of moving a hot dentry at
dput() time, we move a cold one from the end of lru. And if the lru
list is short, we just chicken out.
Our definition of "short lru list" can either be based on a ratio of
pinned to unpinned dentries or on a metric of cache hits vs. cache
misses. I tend to dislike the cache hit metric, because updatedb would
cause tons of misses and result in the same mess we have right now.
With this double cache, we have a source of slabs to cheaply reap under
memory pressure, but still have a performance advantage (memcpy beats
disk io by orders of magnitude).
Jörn
--
The story so far:
In the beginning the Universe was created. This has made a lot
of people very angry and been widely regarded as a bad move.
-- Douglas Adams
On Sep 23, 2007, at 02:22:12, Goswin von Brederlow wrote:
> [email protected] (Mel Gorman) writes:
>> On (16/09/07 23:58), Goswin von Brederlow didst pronounce:
>>> But when you already have say 10% of the ram in mixed groups then
>>> it is a sign the external fragmentation happens and some time
>>> should be spend on moving movable objects.
>>
>> I'll play around with it on the side and see what sort of results
>> I get. I won't be pushing anything any time soon in relation to
>> this though. For now, I don't intend to fiddle more with grouping
>> pages by mobility for something that may or may not be of benefit
>> to a feature that hasn't been widely tested with what exists today.
>
> I watched the videos you posted. A nice and quite clear improvement
> with and without your logic. Cudos.
>
> When you play around with it may I suggest a change to the display
> of the memory information. I think it would be valuable to use a
> Hilbert Curve to arange the pages into pixels. Like this:
>
> # # 0 3
> # #
> ### 1 2
>
> ### ### 0 1 E F
> # #
> ### ### 3 2 D C
> # #
> # ### # 4 7 8 B
> # # # #
> ### ### 5 6 9 A
Here's an excellent example of an 0-255 numbered hilbert curve used
to enumerate the various top-level allocations of IPv4 space:
http://xkcd.com/195/
Cheers,
Kyle Moffett
On Sun, Sep 23, 2007 at 08:56:39AM +0200, Goswin von Brederlow wrote:
> As a user I know it because I didn't put a kernel source into /tmp. A
> programm can't reasonably know that.
Various apps requires you (admin/user) to tune the size of their
caches. Seems like you never tried to setup a database, oh well.
> Xen has its own memory pool and can quite agressively reclaim memory
> from dom0 when needed. I just ment to say that the number in
The whole point is if there's not enough ram of course... this is why
you should check.
> /proc/meminfo can change in a second so it is not much use knowing
> what it said last minute.
The numbers will change depending on what's running on your
system. It's up to you to know plus I normally keep vmstat monitored
in the background to see how the cache/free levels change over
time. Those numbers are worthless if they could be fragmented...
> I would kill any programm that does that to find out how much free ram
> the system has.
The admin should do that if he's unsure, not a program of course!
On Fri, 21 Sep 2007, Hugh Dickins wrote:
> I've found some fixes needed on top of your Large Blocksize Support
> patches: I'll send those to you in a moment. Looks like you didn't
> try much swapping!
yup. Thanks for looking at it.
>
> I only managed to get ext2 working with larger blocksizes:
> reiserfs -b 8192 wouldn't mount ("reiserfs_fill_super: can not find
> reiserfs on /dev/sdb1"); ext3 gave me mysterious errors ("JBD: tar
> wants too many credits", even after adding JBD patches that you
> turned out to be depending on); and I didn't try ext4 or xfs
> (I'm guessing the latter has been quite well tested ;)
Yes, there were issues with the first releases of the JBD patches. The
current crop in mm is fine but much of that may have bypassed this list.
On Thursday 20 September 2007 11:38, David Chinner wrote:
> On Wed, Sep 19, 2007 at 04:04:30PM +0200, Andrea Arcangeli wrote:
> > Plus of course you don't like fsblock because it requires work to
> > adapt a fs to it, I can't argue about that.
>
> No, I don't like fsblock because it is inherently a "struture
> per filesystem block" construct, just like buggerheads. You
> still need to allocate millions of them when you have millions
> dirty pages around. Rather than type it all out again, read
> the fsblocks thread from here:
I don't think there is anything inherently wrong with a structure
per filesystem block construct. In the places where you want to
have a handle on that information.
In the data path of a lot of filesystems, it's not really useful, no
question.
But the block / buffer head concept is there and is useful for many
things obviously. fsblock, I believe, improves on it; maybe even to
the point where there won't be too much reason for many
filesystems to convert their data paths to something different (eg.
nobh mode, or an extent block mapping).
> http://marc.info/?l=linux-fsdevel&m=118284983925719&w=2
>
> FWIW, with Chris mason's extent-based block mapping (which btrfs
> is using and Christoph Hellwig is porting XFS over to) we completely
> remove buggerheads from XFS and so fsblock would be a pretty
> major step backwards for us if Chris's work goes into mainline.
If you don't need to manipulate or manage the pagecache on a
per-block basis anyway, then you shouldn't need fsblock (or anything
else particularly special) to do higher order block sizes.
If you do sometimes need to, then fsblock *may* be a way you can
remove vmap code from your filesystem and share it with generic
code...
> But, I'm not going to argue endlessly for one solution or another;
> I'm happy to see different solutions being chased, so may the
> best VM win ;)
Agreed :)