Hello all,
I've discussed adding Page Attribute Table (PAT) support to the kernel w/ a few developers offline. They were very supportive and suggested I bring the discussion to lkml so others could get involved.
PAT support allows setting cache attributes via the virtual page table entries that are traditionally set via the MTRRs. The specific cache attribute graphics companies such as ourselves (nvidia), ATI, Matrox, and others are becoming interested in is Write-Combining (WC), both for the AGP and framebuffer apertures. Traditionally, these apertures are marked WC by setting the physical memory ranges to WC in the MTRRs. This has traditionally worked very well, but is becoming a problem with workstation systems with 1+ Gigs of memory.
The problem here is that the system bios typically covers physical ram with Write-Back (WB) MTRRs. On systems with large amounts of physical ram, especially when physical memory ranges can intersperse with ram, the bioses are using multiple MTRRs with strange results. In some cases, enough MTRRs are used to cover physical ram, such that MTRRs are not left over for the AGP or framebuffer apertures. In other cases, 1 MTRR is used to mark non-physical ram as Uncached (which covers both apertures). When trying to mark the appropriate apertures as WC, the kernel refuses to overlap the MTRRs.
Windows works around this MTRR issue by using the PATs.
An example of such a report recently sent to lkml is here:
http://www.ussg.iu.edu/hypermail/linux/kernel/0303.1/0606.html
I discussed this some with Jeff Hartmann, who had some initial development code that was integrated into agpgart, for marking agp pages WC as they were allocated. I think it would be preferable to have pat support seperate from agpgart. In that way, other drivers could make use of PAT support for other means (such as mapping the framebuffer). Jeff Hartmann sent us a pass at adding PAT support to agpgart. We've modified his code slightly to be more generic (standalone from agpgart) and usable via the traditional __pgprot() macros (and therefore with the change_page_attr() function).
Please cc me on any responses, as I'm not on the list.
Thanks,
Terence
Terence Ripperda <[email protected]> writes:
> Hello all,
>
> I've discussed adding Page Attribute Table (PAT) support to the kernel w/ a few developers offline. They were very supportive and suggested I bring the discussion to lkml so others could get involved.
change_page_attr() will already do it for the kernel mappings. Just
define a PAGE_KERNEL_WC. Drawback is that it will convert the mapping
to 4K pages (from 2/4MB), but there is probably no alternative unless
all your mappings are 2MB aligned.
But the tricky part of it is that you need to make sure all mappings
to that memory have the same caching attribute, otherwise you invoke
undefined x86 behaviour and risk cache corruptions on some CPUs.
For the special case of AGP it's quite simple - when an user process maps
the aperture it can just set the correct bits in its own mmap method
like it already does for uncachable mappings. But for other mappings
it is more difficult.
For normal memory you would need to find a way to synchronize the
attributes in all mappers (e.g. setting a flag in struct page or
similar). For frame buffer you also need to handle it in all mmap'ers
(like fbcon or /dev/mem). I think handling these generic cases will
need a few VM changes.
[actually even the agp aperture can be accessed using /dev/mem,
but thats probably unlikely to happen because there is a better interface
and could be ignored]
-Andi
Terence Ripperda <[email protected]>, <[email protected]> writes:
> Hello all,
>
> I've discussed adding Page Attribute Table (PAT) support to the kernel w/ a few developers offline. They were very supportive and suggested I bring the discussion to lkml so others could get involved.
Not that I disagre with utilising the PAT, but I don't see anything in this code to
deal with the widespread PAT indexing erratum in Intel's processors. I don't have
the errata sheets here, but it definitely affected the PIIIs and I think also some P4s.
(Large pages ignoring PAT index bit 2, or something like that.)
/Mikael
On Tue, May 20, 2003 at 09:10:18PM +0200, [email protected] wrote:
> change_page_attr() will already do it for the kernel mappings. Just
> define a PAGE_KERNEL_WC.
correct. that was the intent.
> But the tricky part of it is that you need to make sure all mappings
> to that memory have the same caching attribute, otherwise you invoke
> undefined x86 behaviour and risk cache corruptions on some CPUs.
yes. implementing the basic PAT support is pretty trivial. it's dealing with these cache attribute issues that is the hard part.
> For normal memory you would need to find a way to synchronize the
> attributes in all mappers (e.g. setting a flag in struct page or
> similar).
are you refering to generic memory that might have shared mappings between multiple processes? I had really only thought of memory explicitly allocated by a driver and mapped to a process, in which this wouldn't be an issue (or is an issue isolated to the specific driver).
it seems it would be easy enough to add a flag to struct page indicating that any future mappings of this memory must be marked with the given attribute. But you'd need to also worry about previous mappings of that page. wouldn't that require a fairly exhaustive scan of who has the physical memory mapped?
would it make sense to limit this functionality to memory mmapped with MAP_PRIVATE rather than MAP_SHARED?
what if process 1 mapped a region WC, forcing process 2 to later map it the same way even though process 2 doesn't care. then process 1 exits and process 3 decides to map the memory. does the caching attribute remain sticky with process 2 (causing process 3 to also need the memory WC), or revert to cached/whatever when the requestor's mapping is removed?
what if 2 processes ask for conflicting mappings? process 1 wants the framebuffer mapped WC, but process 2 asks for it cacheable. or process 1 maps 1/2 of the framebuffer WC and process 2 asks for the full framebuffer uncached.
a lot of these are corner cases that are unlikely to be desirable, but probably should be protected against.
> For frame buffer you also need to handle it in all mmap'ers
> (like fbcon or /dev/mem). I think handling these generic cases will
> need a few VM changes.
yes, this was the case I was more worried about, but it looks like the case above will have the same issues.
I don't think there's any way currently to determine if anyone already has a mapping to a given address range. And it seems that scanning for pre-existing mappings would be pretty ugly. are there any other suggestions for how to handle this?
Thanks,
Terence
Thanks Mikael,
I was unaware of the errata, I'll check into that.
Terence
On Tue, May 20, 2003 at 09:30:35PM +0200, [email protected] wrote:
> Terence Ripperda <[email protected]>, <[email protected]> writes:
> > Hello all,
> >
> > I've discussed adding Page Attribute Table (PAT) support to the kernel w/ a few developers offline. They were very supportive and suggested I bring the discussion to lkml so others could get involved.
>
> Not that I disagre with utilising the PAT, but I don't see anything in this code to
> deal with the widespread PAT indexing erratum in Intel's processors. I don't have
> the errata sheets here, but it definitely affected the PIIIs and I think also some P4s.
> (Large pages ignoring PAT index bit 2, or something like that.)
>
> /Mikael
On Tue, May 20, 2003 at 10:18:55PM +0200, Terence Ripperda wrote:
> > For normal memory you would need to find a way to synchronize the
> > attributes in all mappers (e.g. setting a flag in struct page or
> > similar).
>
> are you refering to generic memory that might have shared mappings between multiple processes? I had really only thought of memory explicitly allocated by a driver and mapped to a process, in which this wouldn't be an issue (or is an issue isolated to the specific driver).
Yes.
>
> it seems it would be easy enough to add a flag to struct page indicating that any future mappings of this memory must be marked with the given attribute. But you'd need to also worry about previous mappings of that page. wouldn't that require a fairly exhaustive scan of who has the physical memory mapped?
Not in 2.5 - it has the new RMAP vm with backlinks from struct page to ptes.
I already used that in a new machine check handler that has similar requirements.
>
> would it make sense to limit this functionality to memory mmapped with MAP_PRIVATE rather than MAP_SHARED?
That would be a bit ugly, but possible if there is no better way.
>
> what if process 1 mapped a region WC, forcing process 2 to later map it the same way even though process 2 doesn't care. then process 1 exits and process 3 decides to map the memory. does the caching attribute remain sticky with process 2 (causing process 3 to also need the memory WC), or revert to cached/whatever when the requestor's mapping is removed?
You have to walk the rmap chains and change the ptes, or return -EINVAL.
(e.g. you could define the API that only transition from writeback to other mappings
is allowed or reversal)
The locking of this is a bit tricky however.
>
> what if 2 processes ask for conflicting mappings? process 1 wants the framebuffer mapped WC, but process 2 asks for it cacheable. or process 1 maps 1/2 of the framebuffer WC and process 2 asks for the full framebuffer uncached.
One of them has to lose. Or use the EINVAL method above.
>
> a lot of these are corner cases that are unlikely to be desirable, but probably should be protected against.
Yes, definitely.
> > For frame buffer you also need to handle it in all mmap'ers
> > (like fbcon or /dev/mem). I think handling these generic cases will
> > need a few VM changes.
>
> yes, this was the case I was more worried about, but it looks like the case above will have the same issues.
The problem is that the frame buffer and the agp aperture normally have no struct page,
so you need to find a different way to store the shared state for them (e.g. a new
rbtree)
>
> I don't think there's any way currently to determine if anyone already has a mapping to a given address range. And it seems that scanning for pre-existing mappings would be pretty ugly. are there any other suggestions for how to handle this?
In 2.4 there isn't, unless you run a RMAP kernel.
In 2.5 it's easy.
-Andi
[email protected] writes:
> (Large pages ignoring PAT index bit 2, or something like that.)
change_page_attr will force 4K pages for these anyways, so for the kernel
direct mapping it should not be an issue.
For the hugetlbfs user mapping you may need to check the case, but
it's probably reasonable to EINVAL there.
Other than that everything should be 4K mapped.
-Andi
Andi Kleen writes:
> [email protected] writes:
>
> > (Large pages ignoring PAT index bit 2, or something like that.)
>
> change_page_attr will force 4K pages for these anyways, so for the kernel
> direct mapping it should not be an issue.
>
> For the hugetlbfs user mapping you may need to check the case, but
> it's probably reasonable to EINVAL there.
>
> Other than that everything should be 4K mapped.
The bug is that 4K pages get the wrong PAT index; the large page
thing is the trigger but the large pages themselves arent' affected.
So 4K pages need to be restricted to the low 4 PAT types.
On Wed, 21 May 2003 12:12:43 +0200
[email protected] wrote:
> Andi Kleen writes:
> > [email protected] writes:
> >
> > > (Large pages ignoring PAT index bit 2, or something like that.)
> >
> > change_page_attr will force 4K pages for these anyways, so for the kernel
> > direct mapping it should not be an issue.
> >
> > For the hugetlbfs user mapping you may need to check the case, but
> > it's probably reasonable to EINVAL there.
> >
> > Other than that everything should be 4K mapped.
>
> The bug is that 4K pages get the wrong PAT index; the large page
> thing is the trigger but the large pages themselves arent' affected.
>
> So 4K pages need to be restricted to the low 4 PAT types.
Should be no issue. cache disabled and write combining seem to be the
only really useful caching types anyways. You don't even need to mess
with the PAT registers for them, the default 486/586 compatible WC and CD PTE
bits should work.
-Andi
Thanks for the tips Andi,
The rmap lookups sound like a good route to go. I'll work on that and post another patch when I have something working.
And I agree that a "first come, first serve" approach that fails any conflicting mapping attempts is a good route to go.
Terence