2002-07-06 05:29:27

by Rik van Riel

[permalink] [raw]
Subject: [PATCH][RFT](2) minimal rmap for 2.5 - akpm tested

Hi,

Almost the same patch as before, except this one has had
a few hours of testing by Andrew Morton and two bugs have
been ironed out, most notably the truncate_complete_page()
race. This patch is probably safe since Andrew got bored
when no new bugs showed up ...

If you have some time left this weekend and feel brave,
please test the patch which can be found at:

http://surriel.com/patches/2.5/2.5.25-rmap-akpmtested

This patch is based on Craig Kulesa's minimal rmap patch
for 2.5.24, with a few changes:
- removed a few unrelated changes
- updated armv/rmap.h for new pagetable layout of linux/arm
- dropped per-zone pte_chain freelists, we want to make per-cpu
ones for SMP scalability
- ported to 2.5.25 (PF_NOWARN instead of PF_RADIX_TREE)
- drop spelling and whitespace fixes (should be merged separately)
- fix truncate_complete_page race condition (akpm)

It should be mostly ready for being integrated into the 2.5 tree,
with the note that pte-highmem support still needs to be implemented
(some IBMers have been volunteered for this task, this functionality
can easily be added afterwards).

Right now this patch needs testing and careful scrutiny. If you can
find anything wrong with it, please let me know.

kind regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/



2002-07-06 06:20:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][RFT](2) minimal rmap for 2.5 - akpm tested

Rik van Riel wrote:
>
> Hi,
>
> Almost the same patch as before, except this one has had
> a few hours of testing by Andrew Morton and two bugs have
> been ironed out, most notably the truncate_complete_page()
> race. This patch is probably safe since Andrew got bored
> when no new bugs showed up ...
>

The box died, but not due to rmap. We have a lock ranking
bug:

do_exit
->mmput
->exit_mmap page_table_lock
->removed_shared_vm_struct
->lock_vma_mappings i_shared_lock

versus

do_truncate
->notify_change
->inode_setattr
->vmtruncate i_shared_lock
->vmtruncate_list
->zap_page_range page_table_lock

It seems that in 2.5.16, a call to remove_shared_vm_struct() was
added to exit_mmap(), inside mm->page_table_lock.

That ranking conflicts with truncate.

-

2002-07-06 19:03:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][RFT](2) minimal rmap for 2.5 - akpm tested



On Fri, 5 Jul 2002, Andrew Morton wrote:
>
> The box died, but not due to rmap. We have a lock ranking
> bug:
>
> do_exit
> ->mmput
> ->exit_mmap page_table_lock
> ->removed_shared_vm_struct
> ->lock_vma_mappings i_shared_lock

I _think_ we should just move the remove_shared_vm_struct() down into the
case where we're closing the mapping, ie something like the appended.

That way we _only_ do the actual page table stuff under the page table
lock, and do all the generic VM/FS stuff outside the lock.

Comments?

Linus

-------------------------------
--- 1.34/mm/mmap.c Thu Jun 27 00:35:55 2002
+++ edited/mm/mmap.c Sat Jul 6 12:06:02 2002
@@ -1121,7 +1121,6 @@
unsigned long end = mpnt->vm_end;

mm->map_count--;
- remove_shared_vm_struct(mpnt);
unmap_page_range(tlb, mpnt, start, end);
mpnt = mpnt->vm_next;
}
@@ -1148,6 +1147,7 @@
*/
while (mpnt) {
struct vm_area_struct * next = mpnt->vm_next;
+ remove_shared_vm_struct(mpnt);
if (mpnt->vm_ops) {
if (mpnt->vm_ops->close)
mpnt->vm_ops->close(mpnt);

2002-07-06 19:52:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][RFT](2) minimal rmap for 2.5 - akpm tested

Linus Torvalds wrote:
>
> On Fri, 5 Jul 2002, Andrew Morton wrote:
> >
> > The box died, but not due to rmap. We have a lock ranking
> > bug:
> >
> > do_exit
> > ->mmput
> > ->exit_mmap page_table_lock
> > ->removed_shared_vm_struct
> > ->lock_vma_mappings i_shared_lock
>
> I _think_ we should just move the remove_shared_vm_struct() down into the
> case where we're closing the mapping, ie something like the appended.
>
> That way we _only_ do the actual page table stuff under the page table
> lock, and do all the generic VM/FS stuff outside the lock.
>
> Comments?

That is basically what do_munmap() does. But I'm quite unfamiliar
with the locking in there.

-

2002-07-06 20:08:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][RFT](2) minimal rmap for 2.5 - akpm tested



On Sat, 6 Jul 2002, Andrew Morton wrote:
>
> That is basically what do_munmap() does. But I'm quite unfamiliar
> with the locking in there.

The only major user of i_shared is really vmtruncate, I think, and it's
quite ok to unmap the file before removing the mapping from the shared
list - if vmtruncate finds a unmapped area, it just won't be doing
anything (zap_page_range, but that won't do anything without any page
tables).

Together with the fact that unmap() already does it this way anyway, it
looks like the obvious fix..

Linus

2002-07-10 17:33:12

by Sebastian Droege

[permalink] [raw]
Subject: Re: [PATCH][RFT](2) minimal rmap for 2.5 - akpm tested

On Sat, 6 Jul 2002 02:31:38 -0300 (BRT)
Rik van Riel <[email protected]> wrote:

> Hi,
>
> Almost the same patch as before, except this one has had
> a few hours of testing by Andrew Morton and two bugs have
> been ironed out, most notably the truncate_complete_page()
> race. This patch is probably safe since Andrew got bored
> when no new bugs showed up ...
>
> If you have some time left this weekend and feel brave,
> please test the patch which can be found at:
>
> http://surriel.com/patches/2.5/2.5.25-rmap-akpmtested
>
> This patch is based on Craig Kulesa's minimal rmap patch
> for 2.5.24, with a few changes:
> - removed a few unrelated changes
> - updated armv/rmap.h for new pagetable layout of linux/arm
> - dropped per-zone pte_chain freelists, we want to make per-cpu
> ones for SMP scalability
> - ported to 2.5.25 (PF_NOWARN instead of PF_RADIX_TREE)
> - drop spelling and whitespace fixes (should be merged separately)
> - fix truncate_complete_page race condition (akpm)
>
> It should be mostly ready for being integrated into the 2.5 tree,
> with the note that pte-highmem support still needs to be implemented
> (some IBMers have been volunteered for this task, this functionality
> can easily be added afterwards).
>
> Right now this patch needs testing and careful scrutiny. If you can
> find anything wrong with it, please let me know.
>
> kind regards,
>
> Rik
Hi,
after running your patch some time I have to say that the old VM implementation and the full rmap patch (by Craig Kulesa) was better.
The system becomes very slow and has to swap in too much after some uptime (4 hours - 2 days) and memory intensive tasks...
Maybe this happens only to me but it's fully reproducable
If you need some more informations just ask

Bye


Attachments:
(No filename) (189.00 B)

2002-07-10 20:40:37

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH][RFT](2) minimal rmap for 2.5 - akpm tested

On Wed, 10 Jul 2002, Sebastian Droege wrote:
> On Sat, 6 Jul 2002 02:31:38 -0300 (BRT)
> Rik van Riel <[email protected]> wrote:
>
> > If you have some time left this weekend and feel brave,
> > please test the patch which can be found at:
> >
> > http://surriel.com/patches/2.5/2.5.25-rmap-akpmtested

> after running your patch some time I have to say that the old VM
> implementation and the full rmap patch (by Craig Kulesa) was better. The
> system becomes very slow and has to swap in too much after some uptime
> (4 hours - 2 days) and memory intensive tasks...
> Maybe this happens only to me but it's fully reproducable

It's a known problem with use-once. Users of plain 2.4.18
are complaining about it, too.

This is something to touch on after the rmap mechanism
has been merged, Linus has indicated that he wants to merge
the thing in small bits so that's what we'll be doing ;)

kind regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-07-10 21:53:05

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH][RFT](2) minimal rmap for 2.5 - akpm tested

On Wednesday 10 July 2002 22:42, Rik van Riel wrote:
> On Wed, 10 Jul 2002, Sebastian Droege wrote:
> > On Sat, 6 Jul 2002 02:31:38 -0300 (BRT)
> > Rik van Riel <[email protected]> wrote:
> >
> > > If you have some time left this weekend and feel brave,
> > > please test the patch which can be found at:
> > >
> > > http://surriel.com/patches/2.5/2.5.25-rmap-akpmtested
>
> > after running your patch some time I have to say that the old VM
> > implementation and the full rmap patch (by Craig Kulesa) was better. The
> > system becomes very slow and has to swap in too much after some uptime
> > (4 hours - 2 days) and memory intensive tasks...
> > Maybe this happens only to me but it's fully reproducable
>
> It's a known problem with use-once. Users of plain 2.4.18
> are complaining about it, too.

Hey, thanks Rik, I know something about that :-) And I'd be testing right
now to see if you're right, if the DAC960 driver compiled successfully.
But it doesn't, and since my test machine won't boot without it... given a
choice between diving into the driver and going back to work on directory
hashing on 2.4...

The tree that builds wins this time.

> This is something to touch on after the rmap mechanism
> has been merged, Linus has indicated that he wants to merge
> the thing in small bits so that's what we'll be doing ;)

I bet it's something a lot dumber, like a memory leak.

--
Daniel

2002-07-11 06:45:07

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFT](2) minimal rmap for 2.5 - akpm tested

On Wed, Jul 10 2002, Daniel Phillips wrote:
> On Wednesday 10 July 2002 22:42, Rik van Riel wrote:
> > On Wed, 10 Jul 2002, Sebastian Droege wrote:
> > > On Sat, 6 Jul 2002 02:31:38 -0300 (BRT)
> > > Rik van Riel <[email protected]> wrote:
> > >
> > > > If you have some time left this weekend and feel brave,
> > > > please test the patch which can be found at:
> > > >
> > > > http://surriel.com/patches/2.5/2.5.25-rmap-akpmtested
> >
> > > after running your patch some time I have to say that the old VM
> > > implementation and the full rmap patch (by Craig Kulesa) was better. The
> > > system becomes very slow and has to swap in too much after some uptime
> > > (4 hours - 2 days) and memory intensive tasks...
> > > Maybe this happens only to me but it's fully reproducable
> >
> > It's a known problem with use-once. Users of plain 2.4.18
> > are complaining about it, too.
>
> Hey, thanks Rik, I know something about that :-) And I'd be testing right
> now to see if you're right, if the DAC960 driver compiled successfully.
> But it doesn't, and since my test machine won't boot without it... given a
> choice between diving into the driver and going back to work on directory
> hashing on 2.4...

Leonard has promised me to convert DAC960 to the "new" pci dma api for
years (or so it seems, actual date may vary, no purchase necessary). I
do have a Mylex controller here myself these days, so it's not
completely impossible that I may do it on a rainy day.

--
Jens Axboe

2002-07-11 09:54:00

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH][RFT](2) minimal rmap for 2.5 - akpm tested

On Thursday 11 July 2002 08:47, Jens Axboe wrote:
> On Wed, Jul 10 2002, Daniel Phillips wrote:
> > ...I'd be testing right
> > now to see if you're right, if the DAC960 driver compiled successfully.
> > But it doesn't, and since my test machine won't boot without it... given a
> > choice between diving into the driver and going back to work on directory
> > hashing on 2.4...
>
> Leonard has promised me to convert DAC960 to the "new" pci dma api for
> years (or so it seems, actual date may vary, no purchase necessary). I
> do have a Mylex controller here myself these days, so it's not
> completely impossible that I may do it on a rainy day.

Well, tell me what the new api is and I'll dive in there. For the record,
what happened to the old one? Backwards compatibility dropped recently?
Mea culpa for not knowing, but those dma api threads were just so bushy.

I wouldn't be surprised if some other little things have rotted as well.

--
Daniel

2002-07-11 10:05:48

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][RFT](2) minimal rmap for 2.5 - akpm tested

On Thu, Jul 11 2002, Daniel Phillips wrote:
> On Thursday 11 July 2002 08:47, Jens Axboe wrote:
> > On Wed, Jul 10 2002, Daniel Phillips wrote:
> > > ...I'd be testing right
> > > now to see if you're right, if the DAC960 driver compiled successfully.
> > > But it doesn't, and since my test machine won't boot without it... given a
> > > choice between diving into the driver and going back to work on directory
> > > hashing on 2.4...
> >
> > Leonard has promised me to convert DAC960 to the "new" pci dma api for
> > years (or so it seems, actual date may vary, no purchase necessary). I
> > do have a Mylex controller here myself these days, so it's not
> > completely impossible that I may do it on a rainy day.
>
> Well, tell me what the new api is and I'll dive in there. For the record,

Documentation/DMA-mapping.txt. Also, DAC960 initial bio conversion
happened before the interface was finalized, so it may need changes in
that regard as well. Documentation/block/biodoc.txt is your friend there
:-)

a quick make drivers/block/DAC960.o shows the following stuff needs
changing immediately:

1) q->queue_lock is a pointer to a lock, not the lock itself. Probably
add a per-controller spinlock to DAC960_Controller_T, and pass that to
blk_init_queue(). Then change DAC960_AcquireControllerLock and friends
in DAC960.h accordingly.

2) wrt DMA mapping, see DAC960_BA_WriteHardwareMailbox
(Virtual_to_Bus64, anyone?)...

And probably lots more will unearth once you start tackling it...

> I wouldn't be surprised if some other little things have rotted as well.

Heh, not at all :-)

--
Jens Axboe

2002-07-22 22:03:40

by Daniel Phillips

[permalink] [raw]
Subject: DAC960 Bitrot

On Thursday 11 July 2002 12:08, Jens Axboe wrote:
> On Thu, Jul 11 2002, Daniel Phillips wrote:
> > On Thursday 11 July 2002 08:47, Jens Axboe wrote:
> > > Leonard has promised me to convert DAC960 to the "new" pci dma api for
> > > years (or so it seems, actual date may vary, no purchase necessary). I
> > > do have a Mylex controller here myself these days, so it's not
> > > completely impossible that I may do it on a rainy day.
> >
> > Well, tell me what the new api is and I'll dive in there. For the record,
>
> Documentation/DMA-mapping.txt. Also, DAC960 initial bio conversion
> happened before the interface was finalized, so it may need changes in
> that regard as well. Documentation/block/biodoc.txt is your friend there
> :-)
>
> a quick make drivers/block/DAC960.o shows the following stuff needs
> changing immediately:
>
> 1) q->queue_lock is a pointer to a lock, not the lock itself. Probably
> add a per-controller spinlock to DAC960_Controller_T, and pass that to
> blk_init_queue(). Then change DAC960_AcquireControllerLock and friends
> in DAC960.h accordingly.

The big change here appears to be the move to per-device request queues.
Somebody apparently already started to update this driver (you?) but
obviously didn't try to compile it. This is new territory for me, so I'll be
moving gingerly in here for a while.

For those locks, I just removed the &'s, which seems like the right thing to
do. The "Controller" lock really seems to be a request queue lock. Now I
think I need to allocate and initialize a request queue, possibly in
DAC960_CreateAuxiliaryStructures. Am I getting warm?

--
Daniel

2002-07-24 14:36:29

by Jens Axboe

[permalink] [raw]
Subject: Re: DAC960 Bitrot

On Tue, Jul 23 2002, Daniel Phillips wrote:
> On Thursday 11 July 2002 12:08, Jens Axboe wrote:
> > On Thu, Jul 11 2002, Daniel Phillips wrote:
> > > On Thursday 11 July 2002 08:47, Jens Axboe wrote:
> > > > Leonard has promised me to convert DAC960 to the "new" pci dma api for
> > > > years (or so it seems, actual date may vary, no purchase necessary). I
> > > > do have a Mylex controller here myself these days, so it's not
> > > > completely impossible that I may do it on a rainy day.
> > >
> > > Well, tell me what the new api is and I'll dive in there. For the record,
> >
> > Documentation/DMA-mapping.txt. Also, DAC960 initial bio conversion
> > happened before the interface was finalized, so it may need changes in
> > that regard as well. Documentation/block/biodoc.txt is your friend there
> > :-)
> >
> > a quick make drivers/block/DAC960.o shows the following stuff needs
> > changing immediately:
> >
> > 1) q->queue_lock is a pointer to a lock, not the lock itself. Probably
> > add a per-controller spinlock to DAC960_Controller_T, and pass that to
> > blk_init_queue(). Then change DAC960_AcquireControllerLock and friends
> > in DAC960.h accordingly.
>
> The big change here appears to be the move to per-device request queues.
> Somebody apparently already started to update this driver (you?) but
> obviously didn't try to compile it. This is new territory for me, so I'll be
> moving gingerly in here for a while.

Well not really, DAC960 is still using the default per-major queues. But
switching to per-device queue would definitely be a Really Good Idea.
The only changes I did to this driver where trivial conversions in the
2.5.1-pre days, in fact even before multi-page bio's existed. This,
btw, is also something you should keep an eye out for -- multi-page bio
support is currently broken. I would suggest also moving DAC960 to the
pci dma api (this is a must) and then move it to use the generic block
helpers for mapping requests. That way there isn't a lot of nasty
duplication there as well, plus it will automatically get the multi-page
issues right.

> For those locks, I just removed the &'s, which seems like the right thing to
> do. The "Controller" lock really seems to be a request queue lock. Now I
> think I need to allocate and initialize a request queue, possibly in
> DAC960_CreateAuxiliaryStructures. Am I getting warm?

If you move the queues to be per-device, then yes you can use the
Controller lock for the queue lock as well. It was made that way for
easy conversions. So basically embed a request_queue_t inside the
DAC960_Controller_T (god damnit, I'm having a hard time even just
writing these ugly names down here :-) and do something ala

RequestQueue = = &Controller->ThisIsTheQueueYouAdded;
spin_lock_init(&Controller->ThisIsTheLockYouAdded);
blk_init_queue(RequestQueue, DAC960_RequestFunction, &Controller->ThisIsTheLockYouAdded);
RequestQueue->queuedata = Controller;
...

Hmm, is DAC960 using a full major per controller?!

But... This is the least of your worries. The major issue here is
updating it to the pci dma api. Once that is done, the rest is trivial
and will actually consist mainly of removing code, not adding.

--
Jens Axboe

2002-07-24 14:45:15

by Jens Axboe

[permalink] [raw]
Subject: Re: DAC960 Bitrot

On Wed, Jul 24 2002, Jens Axboe wrote:
> Hmm, is DAC960 using a full major per controller?!

Yep it is, ok so it was always using per-controller queues by virtue of
that. So you'll just need to add a lock to Controller, not the queue
itself. On second thought, please add the queue as well though. Maybe we
can kill BLK_DEFAULT_QUEUE() sometime soon then :-)

--
Jens Axboe

2002-07-24 16:07:38

by Wakko Warner

[permalink] [raw]
Subject: Re: DAC960 Bitrot

> Well not really, DAC960 is still using the default per-major queues. But
> switching to per-device queue would definitely be a Really Good Idea.
> The only changes I did to this driver where trivial conversions in the
> 2.5.1-pre days, in fact even before multi-page bio's existed. This,

I'd like to make a suggestion which actually doesn't pertain to the original
thread, but since this is being read, I thought it would be a good place.

Why not add a configure option to disable firmware checking? I have a few
cards here with firmware 2.x x < 73 and they work fine (disabled the check).

--
Lab tests show that use of micro$oft causes cancer in lab animals

2002-07-24 16:53:01

by Daniel Phillips

[permalink] [raw]
Subject: Re: DAC960 Bitrot

On Wednesday 24 July 2002 16:39, Jens Axboe wrote:
> The only changes I did to this driver where trivial conversions in the
> 2.5.1-pre days, in fact even before multi-page bio's existed. This,
> btw, is also something you should keep an eye out for -- multi-page bio
> support is currently broken.

I spotted that. I changed bio_size (which is gone) to bio_sectors(bio) << 9,
is this correct?

> I would suggest also moving DAC960 to the
> pci dma api (this is a must) and then move it to use the generic block
> helpers for mapping requests. That way there isn't a lot of nasty
> duplication there as well, plus it will automatically get the multi-page
> issues right.

My first concern is to get something working any way I can so that I can
start doing regression testing. True/false: the bad old way of doing dma
will still work, it's just deprecated? If true, then I should (trivially)
switch back to the old way of doing things, get the rest working, then
convert to the dma api. Maybe *you* could make all the changes at the same
time and expect to end up with something that works, but I can't.

The alternative is to go back many kernel versions and find the first one
that broke something, but I don't want to do that because too much else was
broken at the time.

> Hmm, is DAC960 using a full major per controller?!

As you saw, it implements the top level block interface instead of being a
scsi device as it should be. So for disk subsystems we have: 1) IDE 2) SCSI
3) DAC960. Eep. At some point it's all going to be SCSI, right?

--
Daniel

2002-07-24 16:59:26

by Jens Axboe

[permalink] [raw]
Subject: Re: DAC960 Bitrot

On Wed, Jul 24 2002, Daniel Phillips wrote:
> On Wednesday 24 July 2002 16:39, Jens Axboe wrote:
> > The only changes I did to this driver where trivial conversions in the
> > 2.5.1-pre days, in fact even before multi-page bio's existed. This,
> > btw, is also something you should keep an eye out for -- multi-page bio
> > support is currently broken.
>
> I spotted that. I changed bio_size (which is gone) to bio_sectors(bio) << 9,
> is this correct?

Probably not. There are all sorts of issues about when to go to the next
bio (bi_next) and when just to grab the next bvec by increasing bi_idx.
It can be pretty hairy for a driver to do.

> > I would suggest also moving DAC960 to the
> > pci dma api (this is a must) and then move it to use the generic block
> > helpers for mapping requests. That way there isn't a lot of nasty
> > duplication there as well, plus it will automatically get the multi-page
> > issues right.
>
> My first concern is to get something working any way I can so that I can
> start doing regression testing. True/false: the bad old way of doing dma

I think that's a really bad idea. Yes it's a lot of work to do the pci
dma conversion (but it's not _hard_), but you'll get the rest for free
once that is done. Really. If you do it your way, then you'll just fix
up the old driver only to rewrite the dma handling (and then convert to
the block helpers) later on. Twice the work. Did I sell the idea or
what?

> will still work, it's just deprecated? If true, then I should (trivially)
> switch back to the old way of doing things, get the rest working, then
> convert to the dma api. Maybe *you* could make all the changes at the same
> time and expect to end up with something that works, but I can't.

See above :)

> The alternative is to go back many kernel versions and find the first one
> that broke something, but I don't want to do that because too much else was
> broken at the time.

Well, you'll go back to 2.5.1-preX and find that the breaking point is
when the bio stuff got merged. And you'll find that the DAC960 driver is
about the same as it is now. So that will buy you nothing, I'm afraid.

> > Hmm, is DAC960 using a full major per controller?!
>
> As you saw, it implements the top level block interface instead of being a
> scsi device as it should be. So for disk subsystems we have: 1) IDE 2) SCSI
> 3) DAC960. Eep. At some point it's all going to be SCSI, right?

Nah not really, at some point it will just be the 'disk' sub system.
BTW, you also forgot at least cpqarray and cciss :-)

--
Jens Axboe

2002-07-24 17:26:50

by Alexander Viro

[permalink] [raw]
Subject: Re: DAC960 Bitrot



On Wed, 24 Jul 2002, Jens Axboe wrote:

> > 3) DAC960. Eep. At some point it's all going to be SCSI, right?
>
> Nah not really, at some point it will just be the 'disk' sub system.
> BTW, you also forgot at least cpqarray and cciss :-)

... along with paride. And assorted floppies. And oddball CDROMs.
And nbd. And...

IOW, Daniel would really benefit from learning to use grep...