LinuxLists.cc - simple framebuffer slower by factor of 20, on socfpga (arm) platform

2015-04-07 12:12:52

Subject: simple framebuffer slower by factor of 20, on socfpga (arm) platform

Hi!

I have an socfpga board, which uses has simple framebuffer implemented
in the FPGA. On 3.15, framebuffer is fast:

root@wagabuibui:~# time cat /dev/fb0 > /dev/null
real 0m 0.00s
user 0m 0.00s
sys 0m 0.00s

on 3.18, this takes 220msec. Similar slowdown exists for
writes. Simple framebuffer did not change at all between 3.15 and
3.18; resource flags of the framebuffer are still same (0x200).

If I enable caching on 3.18, it speeds up a bit, to 70msec or
so... Which means problem is not only in caching.

Any ideas?

Best regards,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-04-07 12:20:08

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

Hi Pavel,

On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <[email protected]> wrote:
> I have an socfpga board, which uses has simple framebuffer implemented
> in the FPGA. On 3.15, framebuffer is fast:
>
> root@wagabuibui:~# time cat /dev/fb0 > /dev/null
> real 0m 0.00s
> user 0m 0.00s
> sys 0m 0.00s
>
> on 3.18, this takes 220msec. Similar slowdown exists for
> writes. Simple framebuffer did not change at all between 3.15 and
> 3.18; resource flags of the framebuffer are still same (0x200).
>
> If I enable caching on 3.18, it speeds up a bit, to 70msec or
> so... Which means problem is not only in caching.
>
> Any ideas?

My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer
copying optimisations"), but this was introduced only in v4.0-rc1.

Just in case you encounter another performance regression after upgrading
to a more modern kernel ;-)

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2015-04-07 14:25:18

by Marek Vasut

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On Tuesday, April 07, 2015 at 02:19:33 PM, Geert Uytterhoeven wrote:
> Hi Pavel,
>
> On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <[email protected]> wrote:
> > I have an socfpga board, which uses has simple framebuffer implemented
> > in the FPGA. On 3.15, framebuffer is fast:
> >
> > root@wagabuibui:~# time cat /dev/fb0 > /dev/null
> > real 0m 0.00s
> > user 0m 0.00s
> > sys 0m 0.00s
> >
> > on 3.18, this takes 220msec. Similar slowdown exists for
> > writes. Simple framebuffer did not change at all between 3.15 and
> > 3.18; resource flags of the framebuffer are still same (0x200).
> >
> > If I enable caching on 3.18, it speeds up a bit, to 70msec or
> > so... Which means problem is not only in caching.
> >
> > Any ideas?
>
> My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer
> copying optimisations"), but this was introduced only in v4.0-rc1.
>
> Just in case you encounter another performance regression after upgrading
> to a more modern kernel ;-)

Why don't you use the Altera VIP FB on SoCFPGA ?

Best regards,
Marek Vasut

2015-04-09 11:06:41

by Pavel Machek

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On Tue 2015-04-07 14:19:33, Geert Uytterhoeven wrote:
> Hi Pavel,
>
> On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <[email protected]> wrote:
> > I have an socfpga board, which uses has simple framebuffer implemented
> > in the FPGA. On 3.15, framebuffer is fast:
> >
> > root@wagabuibui:~# time cat /dev/fb0 > /dev/null
> > real 0m 0.00s
> > user 0m 0.00s
> > sys 0m 0.00s
> >
> > on 3.18, this takes 220msec. Similar slowdown exists for
> > writes. Simple framebuffer did not change at all between 3.15 and
> > 3.18; resource flags of the framebuffer are still same (0x200).
> >
> > If I enable caching on 3.18, it speeds up a bit, to 70msec or
> > so... Which means problem is not only in caching.
> >
> > Any ideas?
>
> My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer
> copying optimisations"), but this was introduced only in v4.0-rc1.
>
> Just in case you encounter another performance regression after upgrading
> to a more modern kernel ;-)

:-). I did a git bisect, and it pointed to this. And reverting it
indeed fixes the problem in 3.18. Problem is still there in 4.0.

Archit do you know what is going on there? Should the revert be filled
for 4.0?

Pavel

commit 981409b25e2a99409b26daa67293ca1cfd5ea0a0
Author: Archit Taneja <[email protected]>
Date: Fri Nov 16 14:46:04 2012 +0530

fbdev: arm has __raw I/O accessors, use them in fb.h

This removes the sparse warnings on arm platforms:

warning: cast removes address space of expression

Signed-off-by: Archit Taneja <[email protected]>
Signed-off-by: Tomi Valkeinen <[email protected]>
Cc: H Hartley Sweeten <hsweeten at visionengravers.com>
Cc: Alexander Shiyan <[email protected]>
Cc: Russell King <[email protected]>

> Gr{oetje,eeting}s,
>
> Geert
>
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
>
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
> -- Linus Torvalds

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-04-09 11:22:30

by Tomi Valkeinen

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On 09/04/15 14:06, Pavel Machek wrote:
> On Tue 2015-04-07 14:19:33, Geert Uytterhoeven wrote:
>> Hi Pavel,
>>
>> On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <[email protected]> wrote:
>>> I have an socfpga board, which uses has simple framebuffer implemented
>>> in the FPGA. On 3.15, framebuffer is fast:
>>>
>>> root@wagabuibui:~# time cat /dev/fb0 > /dev/null
>>> real 0m 0.00s
>>> user 0m 0.00s
>>> sys 0m 0.00s
>>>
>>> on 3.18, this takes 220msec. Similar slowdown exists for
>>> writes. Simple framebuffer did not change at all between 3.15 and
>>> 3.18; resource flags of the framebuffer are still same (0x200).
>>>
>>> If I enable caching on 3.18, it speeds up a bit, to 70msec or
>>> so... Which means problem is not only in caching.
>>>
>>> Any ideas?
>>
>> My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer
>> copying optimisations"), but this was introduced only in v4.0-rc1.
>>
>> Just in case you encounter another performance regression after upgrading
>> to a more modern kernel ;-)
>
> :-). I did a git bisect, and it pointed to this. And reverting it
> indeed fixes the problem in 3.18. Problem is still there in 4.0.

Interesting. I can reproduce this also on TI's AM437x board, on 3.14
kernel. Without the patch:

# time cat /dev/fb0 > /dev/null
real 0m 0.01s
user 0m 0.00s
sys 0m 0.01s

With the patch:

# time cat /dev/fb0 > /dev/null
real 0m 0.19s
user 0m 0.01s
sys 0m 0.17s

> Archit do you know what is going on there? Should the revert be filled
> for 4.0?

(Cc'ing Archit's new email)

>
> Pavel
>
> commit 981409b25e2a99409b26daa67293ca1cfd5ea0a0
> Author: Archit Taneja <[email protected]>
> Date: Fri Nov 16 14:46:04 2012 +0530
>
> fbdev: arm has __raw I/O accessors, use them in fb.h
>
> This removes the sparse warnings on arm platforms:
>
> warning: cast removes address space of expression
>
> Signed-off-by: Archit Taneja <[email protected]>
> Signed-off-by: Tomi Valkeinen <[email protected]>
> Cc: H Hartley Sweeten <hsweeten at visionengravers.com>
> Cc: Alexander Shiyan <[email protected]>
> Cc: Russell King <[email protected]>

Attachments:

signature.asc (836.00 B)
OpenPGP digital signature

2015-04-09 11:35:09

by Tomi Valkeinen

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On 09/04/15 14:21, Tomi Valkeinen wrote:
> On 09/04/15 14:06, Pavel Machek wrote:
>> On Tue 2015-04-07 14:19:33, Geert Uytterhoeven wrote:
>>> Hi Pavel,
>>>
>>> On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <[email protected]> wrote:
>>>> I have an socfpga board, which uses has simple framebuffer implemented
>>>> in the FPGA. On 3.15, framebuffer is fast:
>>>>
>>>> root@wagabuibui:~# time cat /dev/fb0 > /dev/null
>>>> real 0m 0.00s
>>>> user 0m 0.00s
>>>> sys 0m 0.00s
>>>>
>>>> on 3.18, this takes 220msec. Similar slowdown exists for
>>>> writes. Simple framebuffer did not change at all between 3.15 and
>>>> 3.18; resource flags of the framebuffer are still same (0x200).
>>>>
>>>> If I enable caching on 3.18, it speeds up a bit, to 70msec or
>>>> so... Which means problem is not only in caching.
>>>>
>>>> Any ideas?
>>>
>>> My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer
>>> copying optimisations"), but this was introduced only in v4.0-rc1.
>>>
>>> Just in case you encounter another performance regression after upgrading
>>> to a more modern kernel ;-)
>>
>> :-). I did a git bisect, and it pointed to this. And reverting it
>> indeed fixes the problem in 3.18. Problem is still there in 4.0.

The difference is probably caused by memcpy() vs memcpy_fromio(). The
comment above memcpy_fromio() says "This needs to be optimized". I think
generally speaking memcpy_fromio() is correct for a framebuffer.

That said, if the fb is in RAM, and is only written by the CPU, I think
a normal memcpy() for fb_memcpy_fromfb() should be fine...

Tomi

Attachments:

signature.asc (836.00 B)
OpenPGP digital signature

2015-04-09 19:51:55

by Arnd Bergmann

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On Thursday 09 April 2015 14:34:26 Tomi Valkeinen wrote:
> On 09/04/15 14:21, Tomi Valkeinen wrote:
> > On 09/04/15 14:06, Pavel Machek wrote:
> >> On Tue 2015-04-07 14:19:33, Geert Uytterhoeven wrote:
> >>> Hi Pavel,
> >>>
> >>> On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <[email protected]> wrote:
> >>>> I have an socfpga board, which uses has simple framebuffer implemented
> >>>> in the FPGA. On 3.15, framebuffer is fast:
> >>>>
> >>>> root@wagabuibui:~# time cat /dev/fb0 > /dev/null
> >>>> real 0m 0.00s
> >>>> user 0m 0.00s
> >>>> sys 0m 0.00s
> >>>>
> >>>> on 3.18, this takes 220msec. Similar slowdown exists for
> >>>> writes. Simple framebuffer did not change at all between 3.15 and
> >>>> 3.18; resource flags of the framebuffer are still same (0x200).
> >>>>
> >>>> If I enable caching on 3.18, it speeds up a bit, to 70msec or
> >>>> so... Which means problem is not only in caching.
> >>>>
> >>>> Any ideas?
> >>>
> >>> My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer
> >>> copying optimisations"), but this was introduced only in v4.0-rc1.
> >>>
> >>> Just in case you encounter another performance regression after upgrading
> >>> to a more modern kernel
> >>
> >> :-). I did a git bisect, and it pointed to this. And reverting it
> >> indeed fixes the problem in 3.18. Problem is still there in 4.0.
>
> The difference is probably caused by memcpy() vs memcpy_fromio(). The
> comment above memcpy_fromio() says "This needs to be optimized". I think
> generally speaking memcpy_fromio() is correct for a framebuffer.
>
> That said, if the fb is in RAM, and is only written by the CPU, I think
> a normal memcpy() for fb_memcpy_fromfb() should be fine...

Could memcpy() cause alignment traps here if the fb pointer is unaligned
and uncached?

Arnd

2015-04-10 07:06:05

by Archit Taneja

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On 04/09/2015 05:04 PM, Tomi Valkeinen wrote:
> On 09/04/15 14:21, Tomi Valkeinen wrote:
>> On 09/04/15 14:06, Pavel Machek wrote:
>>> On Tue 2015-04-07 14:19:33, Geert Uytterhoeven wrote:
>>>> Hi Pavel,
>>>>
>>>> On Tue, Apr 7, 2015 at 2:12 PM, Pavel Machek <[email protected]> wrote:
>>>>> I have an socfpga board, which uses has simple framebuffer implemented
>>>>> in the FPGA. On 3.15, framebuffer is fast:
>>>>>
>>>>> root@wagabuibui:~# time cat /dev/fb0 > /dev/null
>>>>> real 0m 0.00s
>>>>> user 0m 0.00s
>>>>> sys 0m 0.00s
>>>>>
>>>>> on 3.18, this takes 220msec. Similar slowdown exists for
>>>>> writes. Simple framebuffer did not change at all between 3.15 and
>>>>> 3.18; resource flags of the framebuffer are still same (0x200).
>>>>>
>>>>> If I enable caching on 3.18, it speeds up a bit, to 70msec or
>>>>> so... Which means problem is not only in caching.
>>>>>
>>>>> Any ideas?
>>>>
>>>> My first guess was commit 67dc0d4758e5 ("vt_buffer: drop console buffer
>>>> copying optimisations"), but this was introduced only in v4.0-rc1.
>>>>
>>>> Just in case you encounter another performance regression after upgrading
>>>> to a more modern kernel ;-)
>>>
>>> :-). I did a git bisect, and it pointed to this. And reverting it
>>> indeed fixes the problem in 3.18. Problem is still there in 4.0.
>
> The difference is probably caused by memcpy() vs memcpy_fromio(). The
> comment above memcpy_fromio() says "This needs to be optimized". I think
> generally speaking memcpy_fromio() is correct for a framebuffer.
>
> That said, if the fb is in RAM, and is only written by the CPU, I think
> a normal memcpy() for fb_memcpy_fromfb() should be fine...

I didn't test for performance regressions when I posted this patch.

A look at _memcpy_fromio in arch/arm/kernel/io.c shows that readb() is
used all the time, even when the source and destination addresses are
aligned for larger reads to be possible. Other archs seem to use readl()
or readq() when they can. Maybe that makes memcpy_fromio slower than the
implementation of memcpy on arm?

Thanks,
Archit

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2015-04-24 13:29:31

by Pavel Machek

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

Hi!

On Fri 2015-04-10 12:35:52, Archit Taneja wrote:
> >That said, if the fb is in RAM, and is only written by the CPU, I think
> >a normal memcpy() for fb_memcpy_fromfb() should be fine...
>
> I didn't test for performance regressions when I posted this patch.
>
> A look at _memcpy_fromio in arch/arm/kernel/io.c shows that readb() is used
> all the time, even when the source and destination addresses are aligned for
> larger reads to be possible. Other archs seem to use readl() or readq() when
> they can. Maybe that makes memcpy_fromio slower than the implementation of
> memcpy on arm?

Ok, can you prepare a patch for me to try? Or should we just revert
the original commit?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-04-24 13:31:33

by Pavel Machek

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

> > The difference is probably caused by memcpy() vs memcpy_fromio(). The
> > comment above memcpy_fromio() says "This needs to be optimized". I think
> > generally speaking memcpy_fromio() is correct for a framebuffer.
> >
> > That said, if the fb is in RAM, and is only written by the CPU, I think
> > a normal memcpy() for fb_memcpy_fromfb() should be fine...
>
> Could memcpy() cause alignment traps here if the fb pointer is unaligned
> and uncached?

Original commit did not comment on any failure, so I expect that is
not a problem here...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-04-24 13:41:25

by Tomi Valkeinen

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On 24/04/15 16:29, Pavel Machek wrote:
> Hi!
>
> On Fri 2015-04-10 12:35:52, Archit Taneja wrote:
>>> That said, if the fb is in RAM, and is only written by the CPU, I think
>>> a normal memcpy() for fb_memcpy_fromfb() should be fine...
>>
>> I didn't test for performance regressions when I posted this patch.
>>
>> A look at _memcpy_fromio in arch/arm/kernel/io.c shows that readb() is used
>> all the time, even when the source and destination addresses are aligned for
>> larger reads to be possible. Other archs seem to use readl() or readq() when
>> they can. Maybe that makes memcpy_fromio slower than the implementation of
>> memcpy on arm?
>
> Ok, can you prepare a patch for me to try? Or should we just revert
> the original commit?

The old way worked fine, afaik, so maybe we can revert. But still, isn't
it more correct to use memcpy_fromio? It's (possibly) io memory we have
here.

Tomi

Attachments:

signature.asc (836.00 B)
OpenPGP digital signature

2015-04-24 13:46:59

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On Fri, Apr 24, 2015 at 3:40 PM, Tomi Valkeinen <[email protected]> wrote:
> On 24/04/15 16:29, Pavel Machek wrote:
>> On Fri 2015-04-10 12:35:52, Archit Taneja wrote:
>>>> That said, if the fb is in RAM, and is only written by the CPU, I think
>>>> a normal memcpy() for fb_memcpy_fromfb() should be fine...
>>>
>>> I didn't test for performance regressions when I posted this patch.
>>>
>>> A look at _memcpy_fromio in arch/arm/kernel/io.c shows that readb() is used
>>> all the time, even when the source and destination addresses are aligned for
>>> larger reads to be possible. Other archs seem to use readl() or readq() when
>>> they can. Maybe that makes memcpy_fromio slower than the implementation of
>>> memcpy on arm?
>>
>> Ok, can you prepare a patch for me to try? Or should we just revert
>> the original commit?
>
> The old way worked fine, afaik, so maybe we can revert. But still, isn't
> it more correct to use memcpy_fromio? It's (possibly) io memory we have
> here.

Yes it is.

So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io().
That will benefit other drivers on ARM, too.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2015-04-26 19:31:51

by Pavel Machek

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On Fri 2015-04-24 15:46:56, Geert Uytterhoeven wrote:
> On Fri, Apr 24, 2015 at 3:40 PM, Tomi Valkeinen <[email protected]> wrote:
> > On 24/04/15 16:29, Pavel Machek wrote:
> >> On Fri 2015-04-10 12:35:52, Archit Taneja wrote:
> >>>> That said, if the fb is in RAM, and is only written by the CPU, I think
> >>>> a normal memcpy() for fb_memcpy_fromfb() should be fine...
> >>>
> >>> I didn't test for performance regressions when I posted this patch.
> >>>
> >>> A look at _memcpy_fromio in arch/arm/kernel/io.c shows that readb() is used
> >>> all the time, even when the source and destination addresses are aligned for
> >>> larger reads to be possible. Other archs seem to use readl() or readq() when
> >>> they can. Maybe that makes memcpy_fromio slower than the implementation of
> >>> memcpy on arm?
> >>
> >> Ok, can you prepare a patch for me to try? Or should we just revert
> >> the original commit?
> >
> > The old way worked fine, afaik, so maybe we can revert. But still, isn't
> > it more correct to use memcpy_fromio? It's (possibly) io memory we have
> > here.
>
> Yes it is.
>
> So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io().
> That will benefit other drivers on ARM, too.

No, sorry. Yes, I could "optimize" memcpy_toio... just by sticking
memcpy there, as for example asm-generic/io.h suggests.

Maybe it would break something. Maybe not, but potential for that
clearly is there... since this is very seldom used function. Or do you
know drivers that would benefit from this?

void _memcpy_toio(volatile void __iomem *to, const void *from, size_t count)
{
const unsigned char *f = from;
while (count) {
count--;
writeb(*f, to);
f++;
to++;
}
}

We have a regression, we have a patch that causes the
regression. Right fix at this point is to revert a "cleanup" that
causes this, not try to "optimize" otherwise unused piece of code.

commit 981409b25e2a99409b26daa67293ca1cfd5ea0a0
Author: Archit Taneja <[email protected]>
Date: Fri Nov 16 14:46:04 2012 +0530

fbdev: arm has __raw I/O accessors, use them in fb.h

This removes the sparse warnings on arm platforms:

warning: cast removes address space of expression

Signed-off-by: Archit Taneja <[email protected]>
Signed-off-by: Tomi Valkeinen <[email protected]>
Cc: H Hartley Sweeten <hsweeten at visionengravers.com>
Cc: Alexander Shiyan <[email protected]>
Cc: Russell King <[email protected]>

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-04-28 13:49:17

by Russell King - ARM Linux

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On Fri, Apr 24, 2015 at 03:46:56PM +0200, Geert Uytterhoeven wrote:
> On Fri, Apr 24, 2015 at 3:40 PM, Tomi Valkeinen <[email protected]> wrote:
> > On 24/04/15 16:29, Pavel Machek wrote:
> >> On Fri 2015-04-10 12:35:52, Archit Taneja wrote:
> >>>> That said, if the fb is in RAM, and is only written by the CPU, I think
> >>>> a normal memcpy() for fb_memcpy_fromfb() should be fine...
> >>>
> >>> I didn't test for performance regressions when I posted this patch.
> >>>
> >>> A look at _memcpy_fromio in arch/arm/kernel/io.c shows that readb() is used
> >>> all the time, even when the source and destination addresses are aligned for
> >>> larger reads to be possible. Other archs seem to use readl() or readq() when
> >>> they can. Maybe that makes memcpy_fromio slower than the implementation of
> >>> memcpy on arm?
> >>
> >> Ok, can you prepare a patch for me to try? Or should we just revert
> >> the original commit?
> >
> > The old way worked fine, afaik, so maybe we can revert. But still, isn't
> > it more correct to use memcpy_fromio? It's (possibly) io memory we have
> > here.
>
> Yes it is.
>
> So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io().
> That will benefit other drivers on ARM, too.

That's not going to happen.

I've had a patch which does that, but people are concerned that it changes
the behaviour of the functions by changing the access size, which could
cause regressions. It seems people are far too worried about that to even
consider trying. :(

--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

2015-04-28 15:28:59

by Nicolas Pitre

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On Tue, 28 Apr 2015, Russell King - ARM Linux wrote:

> On Fri, Apr 24, 2015 at 03:46:56PM +0200, Geert Uytterhoeven wrote:
> > So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io().
> > That will benefit other drivers on ARM, too.
>
> That's not going to happen.
>
> I've had a patch which does that, but people are concerned that it changes
> the behaviour of the functions by changing the access size, which could
> cause regressions. It seems people are far too worried about that to even
> consider trying. :(

What about making the optimized implementation available via kconfig?

Nicolas

2015-05-06 10:45:35

by Russell King - ARM Linux

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On Tue, Apr 28, 2015 at 11:28:53AM -0400, Nicolas Pitre wrote:
> On Tue, 28 Apr 2015, Russell King - ARM Linux wrote:
>
> > On Fri, Apr 24, 2015 at 03:46:56PM +0200, Geert Uytterhoeven wrote:
> > > So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io().
> > > That will benefit other drivers on ARM, too.
> >
> > That's not going to happen.
> >
> > I've had a patch which does that, but people are concerned that it changes
> > the behaviour of the functions by changing the access size, which could
> > cause regressions. It seems people are far too worried about that to even
> > consider trying. :(
>
> What about making the optimized implementation available via kconfig?

I'd prefer not to. My personal feeling is to put the patch in and just be
done with it - these functions are supposed to be used on IO areas which
don't care about access size (in other words, are memory-like rather than
being register-like.) Here's the rather old patch:

From: Russell King <[email protected]>
Subject: [PATCH] ARM: optimize memset_io()/memcpy_fromio()/memcpy_toio()

If we are building for a LE platform, and we haven't overriden the
MMIO ops, then we can optimize the mem*io operations using the
standard string functions.

Signed-off-by: Russell King <[email protected]>
---
arch/arm/include/asm/io.h | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/arch/arm/include/asm/io.h b/arch/arm/include/asm/io.h
index d070741b2b37..358c8206419b 100644
--- a/arch/arm/include/asm/io.h
+++ b/arch/arm/include/asm/io.h
@@ -23,6 +23,7 @@

#ifdef __KERNEL__

+#include <linux/string.h>
#include <linux/types.h>
#include <asm/byteorder.h>
#include <asm/memory.h>
@@ -312,9 +313,33 @@ extern void _memset_io(volatile void __iomem *, int, size_t);
#define writesw(p,d,l) __raw_writesw(p,d,l)
#define writesl(p,d,l) __raw_writesl(p,d,l)

+#ifndef __ARMBE__
+static inline void memset_io(volatile void __iomem *dst, unsigned c,
+ size_t count)
+{
+ memset((void __force *)dst, c, count);
+}
+#define memset_io(dst,c,count) memset_io(dst,c,count)
+
+static inline void memcpy_fromio(void *to, const volatile void __iomem *from,
+ size_t count)
+{
+ memcpy(to, (const void __force *)from, count);
+}
+#define memcpy_fromio(to,from,count) memcpy_fromio(to,from,count)
+
+static inline void memcpy_toio(volatile void __iomem *to, const void *from,
+ size_t count)
+{
+ memcpy((void __force *)to, from, count);
+}
+#define memcpy_toio(to,from,count) memcpy_toio(to,from,count)
+
+#else
#define memset_io(c,v,l) _memset_io(c,(v),(l))
#define memcpy_fromio(a,c,l) _memcpy_fromio((a),c,(l))
#define memcpy_toio(c,a,l) _memcpy_toio(c,(a),(l))
+#endif

#endif /* readl */

--
1.8.3.1

--
FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
according to speedtest.net.

2015-05-06 20:32:34

by Nicolas Pitre

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On Wed, 6 May 2015, Russell King - ARM Linux wrote:

> On Tue, Apr 28, 2015 at 11:28:53AM -0400, Nicolas Pitre wrote:
> > On Tue, 28 Apr 2015, Russell King - ARM Linux wrote:
> >
> > > On Fri, Apr 24, 2015 at 03:46:56PM +0200, Geert Uytterhoeven wrote:
> > > > So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io().
> > > > That will benefit other drivers on ARM, too.
> > >
> > > That's not going to happen.
> > >
> > > I've had a patch which does that, but people are concerned that it changes
> > > the behaviour of the functions by changing the access size, which could
> > > cause regressions. It seems people are far too worried about that to even
> > > consider trying. :(
> >
> > What about making the optimized implementation available via kconfig?
>
> I'd prefer not to. My personal feeling is to put the patch in and just be
> done with it - these functions are supposed to be used on IO areas which
> don't care about access size (in other words, are memory-like rather than
> being register-like.) Here's the rather old patch:
>
> From: Russell King <[email protected]>
> Subject: [PATCH] ARM: optimize memset_io()/memcpy_fromio()/memcpy_toio()
>
> If we are building for a LE platform, and we haven't overriden the
> MMIO ops, then we can optimize the mem*io operations using the
> standard string functions.
>
> Signed-off-by: Russell King <[email protected]>

Acked-by: Nicolas Pitre <[email protected]>

> ---
> arch/arm/include/asm/io.h | 25 +++++++++++++++++++++++++
> 1 file changed, 25 insertions(+)
>
> diff --git a/arch/arm/include/asm/io.h b/arch/arm/include/asm/io.h
> index d070741b2b37..358c8206419b 100644
> --- a/arch/arm/include/asm/io.h
> +++ b/arch/arm/include/asm/io.h
> @@ -23,6 +23,7 @@
>
> #ifdef __KERNEL__
>
> +#include <linux/string.h>
> #include <linux/types.h>
> #include <asm/byteorder.h>
> #include <asm/memory.h>
> @@ -312,9 +313,33 @@ extern void _memset_io(volatile void __iomem *, int, size_t);
> #define writesw(p,d,l) __raw_writesw(p,d,l)
> #define writesl(p,d,l) __raw_writesl(p,d,l)
>
> +#ifndef __ARMBE__
> +static inline void memset_io(volatile void __iomem *dst, unsigned c,
> + size_t count)
> +{
> + memset((void __force *)dst, c, count);
> +}
> +#define memset_io(dst,c,count) memset_io(dst,c,count)
> +
> +static inline void memcpy_fromio(void *to, const volatile void __iomem *from,
> + size_t count)
> +{
> + memcpy(to, (const void __force *)from, count);
> +}
> +#define memcpy_fromio(to,from,count) memcpy_fromio(to,from,count)
> +
> +static inline void memcpy_toio(volatile void __iomem *to, const void *from,
> + size_t count)
> +{
> + memcpy((void __force *)to, from, count);
> +}
> +#define memcpy_toio(to,from,count) memcpy_toio(to,from,count)
> +
> +#else
> #define memset_io(c,v,l) _memset_io(c,(v),(l))
> #define memcpy_fromio(a,c,l) _memcpy_fromio((a),c,(l))
> #define memcpy_toio(c,a,l) _memcpy_toio(c,(a),(l))
> +#endif
>
> #endif /* readl */
>
> --
> 1.8.3.1
>
>
>
> --
> FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up
> according to speedtest.net.
>
>

2015-05-12 08:52:59

by Pavel Machek

[permalink] [raw]

Subject: Re: simple framebuffer slower by factor of 20, on socfpga (arm) platform

On Wed 2015-05-06 11:45:04, Russell King - ARM Linux wrote:
> On Tue, Apr 28, 2015 at 11:28:53AM -0400, Nicolas Pitre wrote:
> > On Tue, 28 Apr 2015, Russell King - ARM Linux wrote:
> >
> > > On Fri, Apr 24, 2015 at 03:46:56PM +0200, Geert Uytterhoeven wrote:
> > > > So please optimize ARM's _memcpy_fromio(), _memcpy_toio(), and _memset_io().
> > > > That will benefit other drivers on ARM, too.
> > >
> > > That's not going to happen.
> > >
> > > I've had a patch which does that, but people are concerned that it changes
> > > the behaviour of the functions by changing the access size, which could
> > > cause regressions. It seems people are far too worried about that to even
> > > consider trying. :(
> >
> > What about making the optimized implementation available via kconfig?
>
> I'd prefer not to. My personal feeling is to put the patch in and just be
> done with it - these functions are supposed to be used on IO areas which
> don't care about access size (in other words, are memory-like rather than
> being register-like.) Here's the rather old patch:
>
> From: Russell King <[email protected]>
> Subject: [PATCH] ARM: optimize memset_io()/memcpy_fromio()/memcpy_toio()
>
> If we are building for a LE platform, and we haven't overriden the
> MMIO ops, then we can optimize the mem*io operations using the
> standard string functions.
>
> Signed-off-by: Russell King <[email protected]>

Tested-by: Pavel Machek <[email protected]>
Acked-by: Pavel Machek <[email protected]>

Works for me, framebuffer performance is back in "too fast to measure"
range.

When this is merged, should 981409b25e2a99409b26daa67293ca1cfd5ea0a0
be reverted in -stable?

Thanks,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html