2004-10-27 15:21:36

by Jeff Garzik

[permalink] [raw]
Subject: Re: news about IDE PIO HIGHMEM bug (was: Re: 2.6.9-mm1)

Bartlomiej Zolnierkiewicz wrote:
> We have stuct page of the first page and a offset.
> We need to obtain struct page of the current page and map it.


Opening this question to a wider audience.

struct scatterlist gives us struct page*, and an offset+length pair.
The struct page* is the _starting_ page of a potentially multi-page run
of data.

The question: how does one get struct page* for the second, and
successive pages in a known-contiguous multi-page run, if one only knows
the first page?

Jeff



2004-10-27 15:54:45

by Martin J. Bligh

[permalink] [raw]
Subject: Re: news about IDE PIO HIGHMEM bug (was: Re: 2.6.9-mm1)

> Bartlomiej Zolnierkiewicz wrote:
>> We have stuct page of the first page and a offset.
>> We need to obtain struct page of the current page and map it.
>
>
> Opening this question to a wider audience.
>
> struct scatterlist gives us struct page*, and an offset+length pair. The struct page* is the _starting_ page of a potentially multi-page run of data.
>
> The question: how does one get struct page* for the second, and successive pages in a known-contiguous multi-page run, if one only knows the first page?

If it's a higher order allocation, just page+1 should be safe. If it just
happens to be contig, it might cross a discontig boundary, and not obey
that rule. Very unlikely, but possible.

M.

2004-10-27 16:02:44

by Martin J. Bligh

[permalink] [raw]
Subject: Re: news about IDE PIO HIGHMEM bug (was: Re: 2.6.9-mm1)

--"Martin J. Bligh" <[email protected]> wrote (on Wednesday, October 27, 2004 08:52:39 -0700):

>> Bartlomiej Zolnierkiewicz wrote:
>>> We have stuct page of the first page and a offset.
>>> We need to obtain struct page of the current page and map it.
>>
>>
>> Opening this question to a wider audience.
>>
>> struct scatterlist gives us struct page*, and an offset+length pair. The struct page* is the _starting_ page of a potentially multi-page run of data.
>>
>> The question: how does one get struct page* for the second, and successive pages in a known-contiguous multi-page run, if one only knows the first page?
>
> If it's a higher order allocation, just page+1 should be safe. If it just
> happens to be contig, it might cross a discontig boundary, and not obey
> that rule. Very unlikely, but possible.

To repeat what I said in IRC ... ;-)

Actually, you could check this with the pfns being the same when >> MAX_ORDER-1.
We should be aligned on a MAX_ORDER boundary, I think.

However, pfn_to_page(page_to_pfn(page) + 1) might be safer. If rather slower.

M.

2004-10-27 16:02:11

by Jeff Garzik

[permalink] [raw]
Subject: Re: news about IDE PIO HIGHMEM bug (was: Re: 2.6.9-mm1)

Martin J. Bligh wrote:
>>Bartlomiej Zolnierkiewicz wrote:
>>
>>>We have stuct page of the first page and a offset.
>>>We need to obtain struct page of the current page and map it.
>>
>>
>>Opening this question to a wider audience.
>>
>>struct scatterlist gives us struct page*, and an offset+length pair. The struct page* is the _starting_ page of a potentially multi-page run of data.
>>
>>The question: how does one get struct page* for the second, and successive pages in a known-contiguous multi-page run, if one only knows the first page?
>
>
> If it's a higher order allocation, just page+1 should be safe. If it just
> happens to be contig, it might cross a discontig boundary, and not obey
> that rule. Very unlikely, but possible.


Unfortunately, it's not.

The block layer just tells us "it's a contiguous run of memory", which
implies nothing really about the allocation size.

Bart and I (and others?) essentially need a "page+1" thing (for 2.4.x
too!), that won't break in the face of NUMA/etc.

Alternatively (or additionally), we may need to make sure the block
layer doesn't merge across zones or NUMA boundaries or whatnot.

Jeff


2004-10-27 16:42:52

by Jeff Garzik

[permalink] [raw]
Subject: [PATCH] Re: news about IDE PIO HIGHMEM bug (was: Re: 2.6.9-mm1)

===== include/linux/mm.h 1.193 vs edited =====
--- 1.193/include/linux/mm.h 2004-10-20 04:37:06 -04:00
+++ edited/include/linux/mm.h 2004-10-27 12:33:28 -04:00
@@ -41,6 +41,8 @@
#define MM_VM_SIZE(mm) TASK_SIZE
#endif

+#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + n)
+
/*
* Linux kernel virtual memory manager primitives.
* The idea being to have a "virtual" mm in the same way


Attachments:
patch (401.00 B)

2004-10-27 17:44:03

by Martin J. Bligh

[permalink] [raw]
Subject: Re: news about IDE PIO HIGHMEM bug (was: Re: 2.6.9-mm1)

> Unfortunately, it's not.
>
> The block layer just tells us "it's a contiguous run of memory", which implies nothing really about the allocation size.
>
> Bart and I (and others?) essentially need a "page+1" thing (for 2.4.x too!), that won't break in the face of NUMA/etc.
>
> Alternatively (or additionally), we may need to make sure the block layer doesn't merge across zones or NUMA boundaries or whatnot.


The latter would be rather more efficient. I don't know how often you
end up doing each operation though ... the page+1 vs the attemtped merge.
Depends on the ratio, I guess.

M.

2004-10-27 18:53:53

by William Lee Irwin III

[permalink] [raw]
Subject: Re: news about IDE PIO HIGHMEM bug

Christoph Hellwig wrote:
>> I think this is the wrong level of interface exposed. Just add two hepler
>> kmap_atomic_sg/kunmap_atomic_sg that gurantee to map/unmap a sg list entry,
>> even if it's bigger than a page.

On Wed, Oct 27, 2004 at 02:33:45PM -0400, Jeff Garzik wrote:
> Why bother mapping anything larger than a page, when none of the users
> need it?
> P.S. In your scheme you would need four helpers; you forgot kmap_sg()
> and kunmap_sg().

This is all a non-issue. The page structure just represents little more
than a physical address to the block layer in the context of merging,
so the pfn_to_page(page_to_pfn(...) + ...) bits calculate this properly.
There is just nothing interesting going on here. Generate the page
structure for the piece of the segment, kmap_atomic() it, and it's done.


-- wli

2004-10-27 21:32:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Re: news about IDE PIO HIGHMEM bug (was: Re: 2.6.9-mm1)

Jeff Garzik <[email protected]> wrote:
>
> > However, pfn_to_page(page_to_pfn(page) + 1) might be safer. If rather slower.
>
>
> Is this patch acceptable to everyone? Andrew?

spose so. The scatterlist API is being a bit silly there.

It might be worthwhile doing:

#ifdef CONFIG_DISCONTIGMEM
#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + n)
#else
#define nth_page(page,n) ((page)+(n))
#endif

2004-10-27 21:42:43

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] Re: news about IDE PIO HIGHMEM bug (was: Re: 2.6.9-mm1)

Jeff Garzik <[email protected]> wrote:
>> However, pfn_to_page(page_to_pfn(page) + 1) might be safer. If
>> rather slower. Is this patch acceptable to everyone? Andrew?

On Wed, Oct 27, 2004 at 02:29:14PM -0700, Andrew Morton wrote:
> spose so. The scatterlist API is being a bit silly there.
> It might be worthwhile doing:
> #ifdef CONFIG_DISCONTIGMEM
> #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + n)
> #else
> #define nth_page(page,n) ((page)+(n))
> #endif

This is actually not quite good enough. Zones are not guaranteed
to have adjacent mem_map[]'s even with CONFIG_DISCONTIGMEM=n. It may
make sense to prevent merging from spanning zones, but frankly the
overhead of the pfn_to_page()/page_to_pfn() is negligible in comparison
to the data movement and (when applicable) virtual windowing, where in
the merging code cpu overhead is a greater concern, particularly for
devices that don't require manual data movement.


-- wli

2004-10-27 21:36:53

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] Re: news about IDE PIO HIGHMEM bug (was: Re: 2.6.9-mm1)

On Wed, Oct 27, 2004 at 02:29:14PM -0700, Andrew Morton wrote:
> spose so. The scatterlist API is being a bit silly there.

Well, it depends on your perspective :)

Each scatterlist entry is supposed to map to a physical segment to be
passed to h/w. Hardware S/G tables just want to see a addr/len pair,
and don't care about machine page size. scatterlist follows a similar
model.

dma_map_sg() and other helpers create a favorable situation, where >90%
of the drivers don't have to care about the VM-size details.
Unfortunately those drivers that need need to do their own data transfer
(like ATA's PIO, instead of DMA) need direct access to each member of an
s/g list.

Jeff



2004-10-27 18:45:02

by Jeff Garzik

[permalink] [raw]
Subject: Re: news about IDE PIO HIGHMEM bug

Christoph Hellwig wrote:
>>To repeat what I said in IRC ... ;-)
>>
>>Actually, you could check this with the pfns being the same when >> MAX_ORDER-1.
>>We should be aligned on a MAX_ORDER boundary, I think.
>>
>>However, pfn_to_page(page_to_pfn(page) + 1) might be safer. If rather slower.
>
>
> I think this is the wrong level of interface exposed. Just add two hepler
> kmap_atomic_sg/kunmap_atomic_sg that gurantee to map/unmap a sg list entry,
> even if it's bigger than a page.

Why bother mapping anything larger than a page, when none of the users
need it?

Jeff



P.S. In your scheme you would need four helpers; you forgot kmap_sg()
and kunmap_sg().

2004-10-27 18:22:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: news about IDE PIO HIGHMEM bug (was: Re: 2.6.9-mm1)

> To repeat what I said in IRC ... ;-)
>
> Actually, you could check this with the pfns being the same when >> MAX_ORDER-1.
> We should be aligned on a MAX_ORDER boundary, I think.
>
> However, pfn_to_page(page_to_pfn(page) + 1) might be safer. If rather slower.

I think this is the wrong level of interface exposed. Just add two hepler
kmap_atomic_sg/kunmap_atomic_sg that gurantee to map/unmap a sg list entry,
even if it's bigger than a page.

2004-10-28 00:26:21

by William Lee Irwin III

[permalink] [raw]
Subject: Re: news about IDE PIO HIGHMEM bug

Christoph Hellwig wrote:
>> I think this is the wrong level of interface exposed. Just add two hepler
>> kmap_atomic_sg/kunmap_atomic_sg that gurantee to map/unmap a sg list entry,
>> even if it's bigger than a page.

On Wed, Oct 27, 2004 at 02:33:45PM -0400, Jeff Garzik wrote:
> Why bother mapping anything larger than a page, when none of the users
> need it?
> P.S. In your scheme you would need four helpers; you forgot kmap_sg()
> and kunmap_sg().

The scheme hch suggested is highly invasive in the area of architecture-
specific fixmap layout and introduces a dependency of fixmap layout on
maximum segment size, which may make it current normal maximum segment
sizes use prohibitive amounts of vmallocspace on 32-bit architectures.

So I'd drop that suggestion, though it's not particularly farfetched.


-- wli