2007-05-04 13:32:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

David Chinner <[email protected]> writes:

> On Fri, Apr 27, 2007 at 12:04:03AM -0700, Andrew Morton wrote:
>
> I've looked at all this but I'm trying to work out if anyone
> else has looked at the impact of doing this. I have direct experience
> with this form of block aggregation - this is pretty much what is
> done in irix - and it's full of nasty, ugly corner cases.
>
> I've got several year-old Irix bugs assigned that are hit every so
> often where one page in the aggregated set has the wrong state, and
> it's simply not possible to either reproduce the problem or work out
> how it happened. The code has grown too complex and convoluted, and
> by the time the problem is noticed (either by hang, panic or bug
> check) the cause of it is long gone.
>
> I don't want to go back to having to deal with this sort of problem
> - I'd much prefer to have a design that does not make the same
> mistakes that lead to these sorts of problem.

So the practical question is. Was it a high level design problem or
was it simply a choice of implementation issue.

Until we code review and implementation that does page aggregation for
linux we can't say how nasty it would be.

Of course what gets confusing is when you mention you refer to the
previous implementation as a buffer cache, because that isn't at all
what Linux had for a buffer cache. The linux buffer cache was the
same as the current page cache except it was index by block number and
not by offset into a file.

>>
>> You're addressing Christoph's straw man here.
>
> No, I'm speaking from years of experience working on a
> page/buffer/chunk cache capable of using both large pages and
> aggregating multiple pages. It has, at times, almost driven me
> insane and I don't want to go back there.

The suggestion seems to be to always aggregate pages (to handle
PAGE_SIZE < block size), and not to even worry about the fact
that it happens that the pages you are aggregating are physically
contiguous. The memory allocator and the block layer can worry
about that. It isn't something the page cache or filesystems
need to pay attention to.

I suspect the implementation in linux would be sufficiently different
that it would not be prone to the same problems. Among other things
we are already do most things on a range of page addresses, so we
would seem to have most of the infrastructure already.

It looks like if we extend the current batching a little more
so it covers all of the interesting cases. (read)

Ensure the dirty bit on all pages in the group when we set it on
one page.

Add re-read when we dirty the group if we don't have it all present.

Round the range we operate on up so we cleanly hit the beginning
and end of the group size.

Only issue the mapping operations on the first page in the group.

Is about what we would have to do to handle multiple pages in one
block in the page cache. There are clearly more details but as
a first approximation I don't see this being fundamentally more
complex then what we are currently doing. Just taking into account a
few more details.



The whole physical continuity thing seems to come cleanly out of a
speculative page allocator, and that would seem to work and provide
improvements on smaller block sizes filesystems so it looks like
a larger general improvement.

Likewise Jens increase the linux scatter gather list size seems like
a more general independent improvement.

So if we can also handle groups of pages that make up a single block
as a independent change we have all of the benefits of large block
sizes with most of them applying to small sector size filesystems
as well.

Given that small block sizes give us better storage efficiency,
which means less disk bandwidth used, which means less time
to get the data off of a slow disk (especially if you can
put multiple files you want simultaneously in that same space).
I'm not convinced that large block sizes are a clear disk performance
advantage, so we should not neglect the small file sizes.

Eric


2007-05-04 16:11:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

On Fri, 4 May 2007, Eric W. Biederman wrote:

> Given that small block sizes give us better storage efficiency,
> which means less disk bandwidth used, which means less time
> to get the data off of a slow disk (especially if you can
> put multiple files you want simultaneously in that same space).
> I'm not convinced that large block sizes are a clear disk performance
> advantage, so we should not neglect the small file sizes.

And the 50% gain in the benchmarks means what? That the device
manufacturers have to redesign their chips?

2007-05-07 04:58:54

by David Chinner

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

On Fri, May 04, 2007 at 07:31:37AM -0600, Eric W. Biederman wrote:
> David Chinner <[email protected]> writes:
>
> > On Fri, Apr 27, 2007 at 12:04:03AM -0700, Andrew Morton wrote:
> > I've got several year-old Irix bugs assigned that are hit every so
> > often where one page in the aggregated set has the wrong state, and
> > it's simply not possible to either reproduce the problem or work out
> > how it happened. The code has grown too complex and convoluted, and
> > by the time the problem is noticed (either by hang, panic or bug
> > check) the cause of it is long gone.
> >
> > I don't want to go back to having to deal with this sort of problem
> > - I'd much prefer to have a design that does not make the same
> > mistakes that lead to these sorts of problem.
>
> So the practical question is. Was it a high level design problem or
> was it simply a choice of implementation issue.

Both. To many things can happen asynchroonously to a page that it
makes it just about impossible to predict all the potential race
conditions that are involved. complexity arose from trying to fix
the races that were uncovered without breaking everything else...

> Until we code review and implementation that does page aggregation for
> linux we can't say how nasty it would be.

We already have an implementation - I've pointed it out several times
now: see fs/xfs/linux-2.6/xfs_buf.[ch].

There are a lot of nasties in there....

> >> You're addressing Christoph's straw man here.
> >
> > No, I'm speaking from years of experience working on a
> > page/buffer/chunk cache capable of using both large pages and
> > aggregating multiple pages. It has, at times, almost driven me
> > insane and I don't want to go back there.
>
> The suggestion seems to be to always aggregate pages (to handle
> PAGE_SIZE < block size), and not to even worry about the fact
> that it happens that the pages you are aggregating are physically
> contiguous. The memory allocator and the block layer can worry
> about that. It isn't something the page cache or filesystems
> need to pay attention to.

perfomrance problems in using discontigous pages and needing to
vmap() them says otherwise....

> I suspect the implementation in linux would be sufficiently different
> that it would not be prone to the same problems. Among other things
> we are already do most things on a range of page addresses, so we
> would seem to have most of the infrastructure already.

Filesystems don't typically do this - they work on blocks and assume
that a block can be directly referenced.

> Given that small block sizes give us better storage efficiency,
> which means less disk bandwidth used, which means less time
> to get the data off of a slow disk (especially if you can
> put multiple files you want simultaneously in that same space).
> I'm not convinced that large block sizes are a clear disk performance
> advantage, so we should not neglect the small file sizes.

Hmmm - we're not talking about using 64k block size filesystems to
store lots of little files or using them on small, slow disks.
We're looking at optimising for multi-petabyte filesystems with
multi-terabyte sized files sustaining throughput of tens to hundreds
of GB/s to/from hundreds to thousands of disk.

I certinaly don't consider 64k block size filesystems as something
suitable for desktop use - maybe PVRs would benefit, but this
is not something you'd use for your kernel build environment on a
single disk in a desktop system....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-05-07 06:57:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [00/17] Large Blocksize Support V3

David Chinner <[email protected]> writes:

> Both. To many things can happen asynchroonously to a page that it
> makes it just about impossible to predict all the potential race
> conditions that are involved. complexity arose from trying to fix
> the races that were uncovered without breaking everything else...

Ok.

>> Until we code review and implementation that does page aggregation for
>> linux we can't say how nasty it would be.
>
> We already have an implementation - I've pointed it out several times
> now: see fs/xfs/linux-2.6/xfs_buf.[ch].
>
> There are a lot of nasties in there....

Yes, and but it isn't a generic implementation in mm/filemap.c,
it is a compatibility layer. It lives with the current deficiencies
instead of removes them.

>> >> You're addressing Christoph's straw man here.
>> >
>> > No, I'm speaking from years of experience working on a
>> > page/buffer/chunk cache capable of using both large pages and
>> > aggregating multiple pages. It has, at times, almost driven me
>> > insane and I don't want to go back there.
>>
>> The suggestion seems to be to always aggregate pages (to handle
>> PAGE_SIZE < block size), and not to even worry about the fact
>> that it happens that the pages you are aggregating are physically
>> contiguous. The memory allocator and the block layer can worry
>> about that. It isn't something the page cache or filesystems
>> need to pay attention to.
>
> perfomrance problems in using discontigous pages
Small scatter lists?

> and needing to vmap() them says otherwise....
Always?

Ugh. I just realized looking at the xfs code that it doesn't
work in the presence of high memory, at least not with 4K pages.

>> I suspect the implementation in linux would be sufficiently different
>> that it would not be prone to the same problems. Among other things
>> we are already do most things on a range of page addresses, so we
>> would seem to have most of the infrastructure already.
>
> Filesystems don't typically do this - they work on blocks and assume
> that a block can be directly referenced.

But that is how mm/filemap.c works. The calls into the filesystem
can be per multi-page group as they are current per page. The point
is that the existing in kernel abstraction is already larger then a
page for doing the work.

>> Given that small block sizes give us better storage efficiency,
>> which means less disk bandwidth used, which means less time
>> to get the data off of a slow disk (especially if you can
>> put multiple files you want simultaneously in that same space).
>> I'm not convinced that large block sizes are a clear disk performance
>> advantage, so we should not neglect the small file sizes.
>
> Hmmm - we're not talking about using 64k block size filesystems to
> store lots of little files or using them on small, slow disks.
> We're looking at optimising for multi-petabyte filesystems with
> multi-terabyte sized files sustaining throughput of tens to hundreds
> of GB/s to/from hundreds to thousands of disk.
>
> I certinaly don't consider 64k block size filesystems as something
> suitable for desktop use - maybe PVRs would benefit, but this
> is not something you'd use for your kernel build environment on a
> single disk in a desktop system....

Yes. You are talking about only fixing the kernel for your giant
64K block filesystems that are only interesting on peta-byte arrays.

I am pointing out that the other fixes that have been discussed.
Optimistic contiguous page allocation and a larger linux scatter
gather list. Are interesting on much smaller filesystem and machine
sizes where small files are still interesting. Making them generally
better improvements for linux.

If you only improve the giant peta-byte raid cases 99% of linux users
simply don't care, and so the code isn't very interesting.

Eric

2007-05-09 06:27:26

by Weigert, Daniel

[permalink] [raw]
Subject: RE: [00/17] Large Blocksize Support V3

Just to stick my two cents in here:

The definition of what is meant by "large" filesystems has to change
with the advances in disk drive technology. In the not too distant
past, a "large" single filesystem was 100 GB. There are now consumer
grade disks on the market with 1 TB available in a single unit. I don't
know about you guys, but that scares the crap out of me, in terms of
dealing with that much space on a desktop machine. Efficiently dealing
with transferring that much data on a desktop (never mind server) means
re-thinking the limitations of the I/O subsystems. What was once the
realm of the data center is now the realm of the living room. Large
data sets are becoming more commonplace (HD Movies, audio files, etc)
with each passing day, and there is no end in sight in the progression.
In addition, with the release of specs recently about larger sector
sizes for disk drives (2048 bytes, or larger), this is going to become a
pressing need for the general case, not just the extremely large
servers, or HPC machines and clusters. Already there is no efficient
way to back up that much space, in a reasonable time, except to have
another disk of a similar or larger size to back up to. Anything we can
do to make disk I/O *Faster* is a win.

I recognize that there is a huge issue in dealing with sub block size
files. The trade off of small files VS large blocks is now a non
trivial problem. Once disk sector sizes increase, the problems will
have to be dealt with in a more intelligent manner, possibly dividing
sectors into smaller logical blocks for small files? Maybe filesystems
that can understand multiple block sizes?

Well, we do live in interesting times; we just have to make the most of
it.

Dan Weigert

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Eric W.
Biederman
Sent: Monday, May 07, 2007 2:56 AM
To: David Chinner
Cc: Andrew Morton; [email protected]; [email protected]; Mel
Gorman; William Lee Irwin III; Jens Axboe; Badari Pulavarty; Maxim
Levitsky
Subject: Re: [00/17] Large Blocksize Support V3

David Chinner <[email protected]> writes:

> Both. To many things can happen asynchroonously to a page that it
> makes it just about impossible to predict all the potential race
> conditions that are involved. complexity arose from trying to fix
> the races that were uncovered without breaking everything else...

Ok.

>> Until we code review and implementation that does page aggregation
for
>> linux we can't say how nasty it would be.
>
> We already have an implementation - I've pointed it out several times
> now: see fs/xfs/linux-2.6/xfs_buf.[ch].
>
> There are a lot of nasties in there....

Yes, and but it isn't a generic implementation in mm/filemap.c,
it is a compatibility layer. It lives with the current deficiencies
instead of removes them.

>> >> You're addressing Christoph's straw man here.
>> >
>> > No, I'm speaking from years of experience working on a
>> > page/buffer/chunk cache capable of using both large pages and
>> > aggregating multiple pages. It has, at times, almost driven me
>> > insane and I don't want to go back there.
>>
>> The suggestion seems to be to always aggregate pages (to handle
>> PAGE_SIZE < block size), and not to even worry about the fact
>> that it happens that the pages you are aggregating are physically
>> contiguous. The memory allocator and the block layer can worry
>> about that. It isn't something the page cache or filesystems
>> need to pay attention to.
>
> perfomrance problems in using discontigous pages
Small scatter lists?

> and needing to vmap() them says otherwise....
Always?

Ugh. I just realized looking at the xfs code that it doesn't
work in the presence of high memory, at least not with 4K pages.

>> I suspect the implementation in linux would be sufficiently different
>> that it would not be prone to the same problems. Among other things
>> we are already do most things on a range of page addresses, so we
>> would seem to have most of the infrastructure already.
>
> Filesystems don't typically do this - they work on blocks and assume
> that a block can be directly referenced.

But that is how mm/filemap.c works. The calls into the filesystem
can be per multi-page group as they are current per page. The point
is that the existing in kernel abstraction is already larger then a
page for doing the work.

>> Given that small block sizes give us better storage efficiency,
>> which means less disk bandwidth used, which means less time
>> to get the data off of a slow disk (especially if you can
>> put multiple files you want simultaneously in that same space).
>> I'm not convinced that large block sizes are a clear disk performance
>> advantage, so we should not neglect the small file sizes.
>
> Hmmm - we're not talking about using 64k block size filesystems to
> store lots of little files or using them on small, slow disks.
> We're looking at optimising for multi-petabyte filesystems with
> multi-terabyte sized files sustaining throughput of tens to hundreds
> of GB/s to/from hundreds to thousands of disk.
>
> I certinaly don't consider 64k block size filesystems as something
> suitable for desktop use - maybe PVRs would benefit, but this
> is not something you'd use for your kernel build environment on a
> single disk in a desktop system....

Yes. You are talking about only fixing the kernel for your giant
64K block filesystems that are only interesting on peta-byte arrays.

I am pointing out that the other fixes that have been discussed.
Optimistic contiguous page allocation and a larger linux scatter
gather list. Are interesting on much smaller filesystem and machine
sizes where small files are still interesting. Making them generally
better improvements for linux.

If you only improve the giant peta-byte raid cases 99% of linux users
simply don't care, and so the code isn't very interesting.

Eric