2001-03-31 04:15:16

by James Simmons

[permalink] [raw]
Subject: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]


>The console driver does not actually use 2.5MB. Does it make sense to
>use an MTRR for the smaller power-of-two region?

If we implement a font cache in the future it could. Also that extra
memory is used to allow scrollback. We could break up the size of the
region. Have it a*2^n+b*2^(n-1)+c*2^(n-2)+... = 2.5 MB. Isn't math grand
:-)

MS: (n) 1. A debilitating and surprisingly widespread affliction that
renders the sufferer barely able to perform the simplest task. 2. A disease.

James Simmons [[email protected]] ____/|
fbdev/console/gfx developer \ o.O|
http://www.linux-fbdev.org =(_)=
http://linuxgfx.sourceforge.net U
http://linuxconsole.sourceforge.net


2001-04-01 15:54:39

by James Simmons

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]


>No, it's the Trident Cyber9525

Sorry. I only have a early driver for trident 9750 and 9850. Their is a
gropup working on trident framebuffers.

MS: (n) 1. A debilitating and surprisingly widespread affliction that
renders the sufferer barely able to perform the simplest task. 2. A disease.

James Simmons [[email protected]] ____/|
fbdev/console/gfx developer \ o.O|
http://www.linux-fbdev.org =(_)=
http://linuxgfx.sourceforge.net U
http://linuxconsole.sourceforge.net

2001-04-01 21:37:37

by Jamie Lokier

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

James Simmons wrote:
> >No, it's the Trident Cyber9525
>
> Sorry. I only have a early driver for trident 9750 and 9850. Their is a
> gropup working on trident framebuffers.

Is it possible that "jump scroll" would provide more performance benefit
than an accelerated driver anyway?

Seeing as you bring up this topic of writing a 9525 driver. It seems to
me rather wasteful that you (collectively linux framebuffer authors),
XFree86 and Berlin are all writing drivers for the same, hugely diverse
class of hardware, to support more or less the same ops on the hardware.

Isn't possible to pool the development effort of video drivers? Doesn't
X require basically the same set of operations as the kernel? I.e.,
initialise the card and video mode (usually the very complex part); do
some rendering ops (usually fairly simple). Sure, X provides a few more
kinds of rendering op, but that part of the code is usually much simpler
and smaller than the initialisation code.

Sorry if this sounds insulting -- it isn't intended that way. I don't
really know what is involved in writing video drivers. All I am seeing
is an _apparent_ reinventing of a rather complex wheel, when it's hard
enough as it is to keep up with all the different cards.

thanks,
-- Jamie

2001-04-03 03:07:46

by James Simmons

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]


>Is it possible that "jump scroll" would provide more performance benefit
>than an accelerated driver anyway?

I wouldn't rule it out. If someone wants to wipe up some code I would have
no problem testing it to see if it is worth it.

>Seeing as you bring up this topic of writing a 9525 driver. It seems to
>me rather wasteful that you (collectively linux framebuffer authors),
>XFree86 and Berlin are all writing drivers for the same, hugely diverse
>class of hardware, to support more or less the same ops on the hardware.
>
>Isn't possible to pool the development effort of video drivers? Doesn't
>X require basically the same set of operations as the kernel? I.e.,
>initialise the card and video mode (usually the very complex part); do
>some rendering ops (usually fairly simple). Sure, X provides a few more
>kinds of rendering op, but that part of the code is usually much simpler
>and smaller than the initialisation code.

Well the goal of each is very much different. Fbcon was developed to deal
the fact that most modern video hardware doesn't support text but graphical
based modes instead. VGA text is slowly going away. Since are goal is to
emulate a text console we just have to provide basic support to provide
just this. We need to

1) Draw basic text -> Glyph operations.

2) scrolling -> hardware panning or a copy area operation.

3) scroll a region of the screen -> copy area operation.

4) Clear the display or region of display -> fillrect

5) Set color palette.

6) Manage a hardware cursor.

7) Manage the current resolution for VC switching or a mode change vi
VT_RESIZE or TIOCSWINSZ.

So fbcon is out of necessite. Now X you mean XFree86 which is really a OS
in itself. Its goal to do everything itself so it can run everywhere
know to mankind. As for Berlin I don't know the code so I can't say.
As people are finding out XFree86 doing everything itself is having
issues. A good example is the classic problem of X dying and you have to
reboot the machine. Also when under heavy load and you exit X to the
console you don't get the text mode. Well right now its tough luck and
just reboot your machine. A M$ solution but people have been doing it
so long they don't mind it. I hope to fix those problems for 2.5.X.
As you can see I think the OS should handle the transfer from console mode
to text mode and vice versa. Now for programming the accel engine to do
graphics in userland. Well their is nothing wrong that each does their own
thing. What does matter is their is a GIU independent kernel manager of
the graphics engine state. DRI attempts to handle this.

MS: (n) 1. A debilitating and surprisingly widespread affliction that
renders the sufferer barely able to perform the simplest task. 2. A disease.

James Simmons [[email protected]] ____/|
fbdev/console/gfx developer \ o.O|
http://www.linux-fbdev.org =(_)=
http://linuxgfx.sourceforge.net U
http://linuxconsole.sourceforge.net

2001-04-05 03:04:25

by James Simmons

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]


>>> As long as you are copying in real memory. So the PCI bus or the host
bridge
>>> implementation may be the actual limit.
>>
>> The CyrixIII sits on the same host bridges as the intel processors
>
>I don't know if it applies to this case but one thing I have seen make
>a noticeable difference is whether or not write-combining is enabled.
>If we have only be enabling MTRR's for intel this could do account
>for it.

I think what Geert was trying to point out is does MTRR perform was well
with normal memory over bus to video memory transfers as compared to
normal memory to normal memory transfers. MTTRs might not be optimzed for
these kinds of transfers. I honestly can't say since I haven't tried it. I
brought the MMX book home from works so I'm going to be experimenting
with it this weekend to find out. I really like to compare the MMX
performance to the word aligned transfers over the bus I have going. I had
a bug in my soft accel code that prevented word alignment. Once I fixed
that bug I seen a 10 fold improvement in rendering on the framebuffer.
I'm not kidding about that improvement either :-)

MTTRs enabled always makes a difference. I liek to try it with and
without. I will do some benchmarkings.

MS: (n) 1. A debilitating and surprisingly widespread affliction that
renders the sufferer barely able to perform the simplest task. 2. A disease.

James Simmons [[email protected]] ____/|
fbdev/console/gfx developer \ o.O|
http://www.linux-fbdev.org =(_)=
http://linuxgfx.sourceforge.net U
http://linuxconsole.sourceforge.net

2001-04-05 12:07:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

James Simmons <[email protected]> writes:

> >>> As long as you are copying in real memory. So the PCI bus or the host
> bridge
> >>> implementation may be the actual limit.
> >>
> >> The CyrixIII sits on the same host bridges as the intel processors
> >
> >I don't know if it applies to this case but one thing I have seen make
> >a noticeable difference is whether or not write-combining is enabled.
> >If we have only be enabling MTRR's for intel this could do account
> >for it.
>
> I think what Geert was trying to point out is does MTRR perform was well
> with normal memory over bus to video memory transfers as compared to
> normal memory to normal memory transfers. MTTRs might not be optimzed for
> these kinds of transfers. I honestly can't say since I haven't tried it. I
> brought the MMX book home from works so I'm going to be experimenting
> with it this weekend to find out. I really like to compare the MMX
> performance to the word aligned transfers over the bus I have going. I had
> a bug in my soft accel code that prevented word alignment. Once I fixed
> that bug I seen a 10 fold improvement in rendering on the framebuffer.
> I'm not kidding about that improvement either :-)
>
> MTTRs enabled always makes a difference. I liek to try it with and
> without. I will do some benchmarkings.

While I'm thinking about it what we really should be using is the PAT
extension and not MTRR's. The PAT extension allows you to set the
attributes per page so you don't have the resource contention you do
with MTRR's. I can just imagine the performance challenges right now
if you try to do a multi-head where multi > number of free MTRR's.

What happens with write-combining is active is that close adjacent
writes are batched together. Without write-combining you tend to get
32bit writes on a bus with a word size of 64 or more bits. By the way
does anyone know who didn't implement MTRR's or the equivalent on
alpha so we can shoot them?

Eric

2001-04-05 12:14:05

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

On 5 Apr 2001, Eric W. Biederman wrote:
> 32bit writes on a bus with a word size of 64 or more bits. By the way
> does anyone know who didn't implement MTRR's or the equivalent on
> alpha so we can shoot them?

People never get shot in Open Source projects. Not when they write buggy code,
not when they don't implement some features.

Gr{oetje,eeting}s,

Geert

P.S. Perhaps ESR tends to disagree? ;-)
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2001-04-05 13:19:20

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

On Thu, 5 Apr 2001, Geert Uytterhoeven wrote:

> > 32bit writes on a bus with a word size of 64 or more bits. By the way
> > does anyone know who didn't implement MTRR's or the equivalent on
> > alpha so we can shoot them?
>
> People never get shot in Open Source projects. Not when they write buggy code,
> not when they don't implement some features.

Was DEC Alpha an Open Source project? ;-)

Memory barriers are more RISC-styled and more flexible anyway (e.g. you
can't run out of them ;-) ), though they require a greater care when
writing code. MTRRs are the Intel style of complicating designs. Still
they are probably a reasonable solution to preserve DOS compatibility.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2001-04-05 18:24:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

"Maciej W. Rozycki" <[email protected]> writes:

> On Thu, 5 Apr 2001, Geert Uytterhoeven wrote:
>
> > > 32bit writes on a bus with a word size of 64 or more bits. By the way
> > > does anyone know who didn't implement MTRR's or the equivalent on
> > > alpha so we can shoot them?
> >
> > People never get shot in Open Source projects. Not when they write buggy code,
>
> > not when they don't implement some features.
>
> Was DEC Alpha an Open Source project? ;-)
>
> Memory barriers are more RISC-styled and more flexible anyway (e.g. you
> can't run out of them ;-) ), though they require a greater care when
> writing code. MTRRs are the Intel style of complicating designs. Still
> they are probably a reasonable solution to preserve DOS compatibility.

The point is on the Alpha all ram is always cached, and i/o space is
completely uncached. You cannot do write-combing for video card
memory. Memory barriers are a separate issue. On the alpha the
natural way to implement it would be in the page table fill code.
Memory barriers are o.k. but the really don't help the case when what
you want to do is read the latest value out of a pci register.

Eric



2001-04-06 10:24:47

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

On Thu, Apr 05, 2001 at 12:20:22PM -0600, Eric W. Biederman wrote:
> The point is on the Alpha all ram is always cached, and i/o space is
> completely uncached. You cannot do write-combing for video card
> memory.

Incorrect. Alphas have write buffers - 6x32 bytes on ev5 and
4x64 on ev6, IIRC. So alphas do write up to 32 or 64 bytes
in a single pci transaction.

> Memory barriers are a separate issue. On the alpha the
> natural way to implement it would be in the page table fill code.
> Memory barriers are o.k. but the really don't help the case when what
> you want to do is read the latest value out of a pci register.

You don't need memory barrier for that. "Write memory barriers" are
used to ensure correct write order, and "memory barriers" are used
to ensure that all pending reads/writes will complete before next read
or write.

Ivan.

2001-04-06 13:24:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

Ivan Kokshaysky <[email protected]> writes:

> On Thu, Apr 05, 2001 at 12:20:22PM -0600, Eric W. Biederman wrote:
> > The point is on the Alpha all ram is always cached, and i/o space is
> > completely uncached. You cannot do write-combing for video card
> > memory.
>
> Incorrect. Alphas have write buffers - 6x32 bytes on ev5 and
> 4x64 on ev6, IIRC. So alphas do write up to 32 or 64 bytes
> in a single pci transaction.

Sorry I was thinking the current alpha the ev6. So what I'm saying
doesn't apply to the alpha architecture in general just it's current
specific implementation.

Yes for the ev6 you have write buffers but can't say just use the
write buffers, on an arbitrary area of memory.

> > Memory barriers are a separate issue. On the alpha the
> > natural way to implement it would be in the page table fill code.
> > Memory barriers are o.k. but the really don't help the case when what
> > you want to do is read the latest value out of a pci register.
>
> You don't need memory barrier for that. "Write memory barriers" are
> used to ensure correct write order, and "memory barriers" are used
> to ensure that all pending reads/writes will complete before next read
> or write.

100% Agreed. That is what I was saying. What the ev6 doesn't have
is the ability to say this: I am using this area of the memory address
space in a particular way: don't cache it but do write combing on it.

Theoretically you could use memory barrier instructions for this but
it would require an I/O bus that supported a cache coherency
protocol. At which point the problem moves down to your PCI bus
controller.

I recall on the ev6 all memory accesses to locations with bit 40 set
are always to IO space are never cached and are never write buffered.
Accesses to memory locations with bit 40 clear are always to RAM are
always cached and always write buffered.

With the high I/O bus speeds unless you are trying to push things to
the absolute limit you are unlikely to see the IO accesses being the
bottleneck in or out to a PCI device. At which point DMA probably
already compensates, for most devices.

IIRC For PCI card IO regions where you need maximum IO speed through
the memory address space (like frame buffers) the ev6 falls down.

I really like the alpha this is why this gals me so much about the
ev6. I hope they have it fixed for the ev7 or ev8. If those chips
ever actually arrive. But as the ev7 is just supposed to be the ev6
core with an on chip cache I don't have much hope.

Eric

2001-04-06 17:31:27

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

On 5 Apr 2001, Eric W. Biederman wrote:

> The point is on the Alpha all ram is always cached, and i/o space is
> completely uncached. You cannot do write-combing for video card

You don't want to cache fb memory, do you? All you want is write
combining and you achieve it with memory barriers. You write to fb memory
space whatever you need to and write buffers actually deliver data to fb
memory whenever the bus is idle or they get filled up. When you finally
decide you wrote all data and you want ensure it actually reaches the fb
memory before you perform an operation (say you send a command to fb's
support circuitry) you issue a write memory barrier. Or a memory barrier,
if you want ensure the data reaches the fb memory ASAP.

In other words, you have write-combining by default and request
write-through explicitly.

> memory. Memory barriers are a separate issue. On the alpha the
> natural way to implement it would be in the page table fill code.

Please forgive me -- I can't see how this is related to write combining.

> Memory barriers are o.k. but the really don't help the case when what
> you want to do is read the latest value out of a pci register.

They do -- you issue an mb and you are sure all pending writes reached
the involved PCI hw.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2001-04-06 17:47:57

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

On Fri, 6 Apr 2001, Ivan Kokshaysky wrote:

> > Memory barriers are a separate issue. On the alpha the
> > natural way to implement it would be in the page table fill code.
> > Memory barriers are o.k. but the really don't help the case when what
> > you want to do is read the latest value out of a pci register.
>
> You don't need memory barrier for that. "Write memory barriers" are
> used to ensure correct write order, and "memory barriers" are used
> to ensure that all pending reads/writes will complete before next read
> or write.

You do. PCI-space registers are volatile and they may change depending
on what was written (or read) previously. A memory barrier before a PCI
read will ensure you get a value that is relevant to previous code
actions. Without a barrier you may get pretty anything, depending on
which of previous writes managed to complete before.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2001-04-06 17:49:09

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

On 6 Apr 2001, Eric W. Biederman wrote:

> I recall on the ev6 all memory accesses to locations with bit 40 set
> are always to IO space are never cached and are never write buffered.

If that is the case then EV6 is seriously flawed. You normally have
non-cached locations buffered (since you don't always need peripheral
device accesses to be posted immediately) and can force a writeback with a
memory barrier. I don't have my 21264 handbook handy, so I can't check
EV6 details at the moment, especially why it is different.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2001-04-06 18:17:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

On Fri, Apr 06, 2001 at 07:27:24PM +0200, Maciej W. Rozycki wrote:
> [..] You normally have
> non-cached locations buffered (since you don't always need peripheral
> device accesses to be posted immediately) and can force a writeback with a
> memory barrier. [..]

ev6 works the way you described AFIK (to flush the write buffer you can use
wmb(), note that wmb() semantics doesn't require the cpu to really "flush" but
just to keep writes oredered across other mb or wmb, but it's basically the
same from a software point of you and flushing the write buffer synchronously
obviously provides that semantics). I didn't followed very closely the
previous part of the thread so I'm not sure what is the issue.

Andrea

2001-04-06 20:15:22

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

On Fri, 6 Apr 2001, Andrea Arcangeli wrote:

> ev6 works the way you described AFIK (to flush the write buffer you can use

Thanks for the clarification -- you made me calm down.

> wmb(), note that wmb() semantics doesn't require the cpu to really "flush" but
> just to keep writes oredered across other mb or wmb, but it's basically the
> same from a software point of you and flushing the write buffer synchronously
> obviously provides that semantics). I didn't followed very closely the

Of course -- you only want to do mb (and not wmb) if you need to meet
hw's specific timing or you want to perform a read from a volatile
register of a peripheral device.

> previous part of the thread so I'm not sure what is the issue.

Someone complained of Alpha not having Intel-style MTRRs to set write
combining for fb memory...

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2001-04-08 18:15:16

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

On Fri, Apr 06, 2001 at 07:13:21PM +0200, Maciej W. Rozycki wrote:
> You do. PCI-space registers are volatile and they may change depending
> on what was written (or read) previously. A memory barrier before a PCI
> read will ensure you get a value that is relevant to previous code
> actions. Without a barrier you may get pretty anything, depending on
> which of previous writes managed to complete before.

Of course. I meant that if you are reading, for example, some status register
in a loop waiting for "ready bit" set, the memory barrier won't help you
to notice this event any faster. Actually you'll notice that *later*, as
"mb" is expensive.

Well, here is some info on ev6 IO write buffers - they are a bit different
than ev4/ev5 ones.
Merging rules:
- byte/word stores aren't allowed to merge into a write buffer;
- different size stores (32- and 64-bit) aren't allowed to merge;
- addresses must be in ascending order and non-overlapping,
but not necessarily consecutive.
The I/O register merge window close (ie write-buffer flushing) occurs after
- mb and wmb instructions;
- IO-space load instruction (!);
- after 1024 cycles if there were no IO-space stores.
Store requests are sent offchip in program order (!).

All this explains, in particular, why XFree86-4.0 worked on ev6 without
memory barriers of any kind, while it crashed badly on ev4/ev5.

Ivan.

2001-04-09 10:25:24

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

On Sun, 8 Apr 2001, Ivan Kokshaysky wrote:

> Of course. I meant that if you are reading, for example, some status register
> in a loop waiting for "ready bit" set, the memory barrier won't help you
> to notice this event any faster. Actually you'll notice that *later*, as
> "mb" is expensive.

I think you need an mb here. To force sychronization with other CPUs.
Unless you know you are UP or there is no possibility another CPU may
access the relevant device.

Of course mbs hit performance but it's a trade off for coherency.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2001-04-09 11:41:58

by Ivan Kokshaysky

[permalink] [raw]
Subject: Re: [Linux-fbdev-devel] Re: fbcon slowness [was NTP on 2.4.2?]

On Mon, Apr 09, 2001 at 12:02:54PM +0200, Maciej W. Rozycki wrote:
> I think you need an mb here. To force sychronization with other CPUs.
> Unless you know you are UP or there is no possibility another CPU may
> access the relevant device.

Yes - in most cases you need synchronization at a higher level.
For instance, you don't want other CPUs accessing the device while
you are sending command sequences to it.

Ivan.