2008-11-16 19:37:23

by Rafael J. Wysocki

[permalink] [raw]
Subject: 2.6.28-rc5: Reported regressions 2.6.26 -> 2.6.27

[NOTE:
I closed a number of Bugzilla entries dedicated to regressions introduced
between 2.6.26 and 2.6.27 that appeared to have been fixed to me or where
the reporters had been totally unresponsive for extended periods of time
(given that they are notified every week ...).]

This message contains a list of some regressions introduced between 2.6.26 and
2.6.27, for which there are no fixes in the mainline I know of. If any of them
have been fixed already, please let me know.

If you know of any other unresolved regressions introduced between 2.6.26
and 2.6.27, please let me know either and I'll add them to the list.
Also, please let me know if any of the entries below are invalid.

Each entry from the list will be sent additionally in an automatic reply to
this message with CCs to the people involved in reporting and handling the
issue.


Listed regressions statistics:

Date Total Pending Unresolved
----------------------------------------
2008-11-16 199 18 14
2008-11-09 196 28 23
2008-11-02 195 34 28
2008-10-26 190 34 29
2008-10-04 181 41 33
2008-09-27 173 35 28
2008-09-21 169 45 36
2008-09-15 163 46 32
2008-09-12 163 51 38
2008-09-07 150 43 33
2008-08-30 135 48 36
2008-08-23 122 48 40
2008-08-16 103 47 37
2008-08-10 80 52 31
2008-08-02 47 31 20


Unresolved regressions
----------------------

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12048
Subject : Regression in bonding between 2.6.26.8 and 2.6.27.6
Submitter : Jesper Krogh <[email protected]>
Date : 2008-11-16 9:41 (1 days old)
References : http://marc.info/?l=linux-kernel&m=122682977001048&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12039
Subject : Regression: USB/DVB 2.6.26.8 --&gt; 2.6.27.6
Submitter : David <[email protected]>
Date : 2008-11-14 20:20 (3 days old)
References : http://marc.info/?l=linux-kernel&m=122669568022274&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11983
Subject : iwlagn: wrong command queue 31, command id 0x0
Submitter : Matt Mackall <[email protected]>
Date : 2008-11-06 4:16 (11 days old)
References : http://marc.info/?l=linux-kernel&m=122598672815803&w=4
http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=1703
Handled-By : reinette chatre <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11886
Subject : without serial console system doesn't poweroff
Submitter : Daniel Smolik <[email protected]>
Date : 2008-10-29 04:06 (19 days old)


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11876
Subject : RCU hang on cpu re-hotplug with 2.6.27rc8
Submitter : Andi Kleen <[email protected]>
Date : 2008-10-06 23:28 (42 days old)
References : http://marc.info/?l=linux-kernel&m=122333610602399&w=2
Handled-By : Paul E. McKenney <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11836
Subject : Scheduler on C2D CPU and latest 2.6.27 kernel
Submitter : Zdenek Kabelac <[email protected]>
Date : 2008-10-21 9:59 (27 days old)
References : http://marc.info/?l=linux-kernel&m=122458320502371&w=4
Handled-By : Chris Snook <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11698
Subject : 2.6.27-rc7, freezes with &gt; 1 s2ram cycle
Submitter : Soeren Sonnenburg <[email protected]>
Date : 2008-09-29 11:29 (49 days old)
References : http://marc.info/?l=linux-kernel&m=122268780926859&w=4
Handled-By : Rafael J. Wysocki <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11664
Subject : acpi errors and random freeze on sony vaio sr
Submitter : Giovanni Pellerano <[email protected]>
Date : 2008-09-28 03:48 (50 days old)


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11569
Subject : Panic stop CPUs regression
Submitter : Andi Kleen <[email protected]>
Date : 2008-09-02 13:49 (76 days old)
References : http://marc.info/?l=linux-kernel&m=122036356127282&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11543
Subject : kernel panic: softlockup in tick_periodic() ???
Submitter : Joshua Hoblitt <[email protected]>
Date : 2008-09-11 16:46 (67 days old)
References : http://marc.info/?l=linux-kernel&m=122117786124326&w=4
Handled-By : Thomas Gleixner <[email protected]>
Cyrill Gorcunov <[email protected]>
Ingo Molnar <[email protected]>
Cyrill Gorcunov <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11404
Subject : BUG: in 2.6.23-rc3-git7 in do_cciss_intr
Submitter : rdunlap <[email protected]>
Date : 2008-08-21 5:52 (88 days old)
References : http://marc.info/?l=linux-kernel&m=121929819616273&w=4
http://marc.info/?l=linux-kernel&m=121932889105368&w=4
Handled-By : Miller, Mike (OS Dev) <[email protected]>
James Bottomley <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308
Subject : tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
Submitter : Christoph Lameter <[email protected]>
Date : 2008-08-11 18:36 (98 days old)
References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4
http://marc.info/?l=linux-kernel&m=122125737421332&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11215
Subject : INFO: possible recursive locking detected ps2_command
Submitter : Zdenek Kabelac <[email protected]>
Date : 2008-07-31 9:41 (109 days old)
References : http://marc.info/?l=linux-kernel&m=121749737011637&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11207
Subject : VolanoMark regression with 2.6.27-rc1
Submitter : Zhang, Yanmin <[email protected]>
Date : 2008-07-31 3:20 (109 days old)
References : http://marc.info/?l=linux-kernel&m=121747464114335&w=4
Handled-By : Zhang, Yanmin <[email protected]>
Peter Zijlstra <[email protected]>
Dhaval Giani <[email protected]>
Miao Xie <[email protected]>


Regressions with patches
------------------------

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11865
Subject : WOL for E100 Doesn't Work Anymore
Submitter : roger <[email protected]>
Date : 2008-10-26 21:56 (22 days old)
Handled-By : Rafael J. Wysocki <[email protected]>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=18646&action=view


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11843
Subject : usb hdd problems with 2.6.27.2
Submitter : Luciano Rocha <[email protected]>
Date : 2008-10-22 16:22 (26 days old)
References : http://marc.info/?l=linux-kernel&m=122469318102679&w=4
Handled-By : Luciano Rocha <[email protected]>
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11843#c26


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805
Subject : mounting XFS produces a segfault
Submitter : Tiago Maluta <[email protected]>
Date : 2008-10-21 18:00 (27 days old)
Handled-By : Dave Chinner <[email protected]>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=18397&action=view


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11795
Subject : ks959-sir dongle no longer works under 2.6.27 (REGRESSION)
Submitter : Alex Villacis Lasso <[email protected]>
Date : 2008-10-20 10:49 (28 days old)
Handled-By : Samuel Ortiz <[email protected]>
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11795#c22


For details, please visit the bug entries and follow the links given in
references.

As you can see, there is a Bugzilla entry for each of the listed regressions.
There also is a Bugzilla entry used for tracking the regressions introduced
between 2.6.26 and 2.6.27, unresolved as well as resolved, at:

http://bugzilla.kernel.org/show_bug.cgi?id=11167

Please let me know if there are any Bugzilla entries that should be added to
the list in there.

Thanks,
Rafael


2008-11-16 19:36:59

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11207] VolanoMark regression with 2.6.27-rc1

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11207
Subject : VolanoMark regression with 2.6.27-rc1
Submitter : Zhang, Yanmin <[email protected]>
Date : 2008-07-31 3:20 (109 days old)
References : http://marc.info/?l=linux-kernel&m=121747464114335&w=4
Handled-By : Zhang, Yanmin <[email protected]>
Peter Zijlstra <[email protected]>
Dhaval Giani <[email protected]>
Miao Xie <[email protected]>

2008-11-16 19:39:38

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11215] INFO: possible recursive locking detected ps2_command

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11215
Subject : INFO: possible recursive locking detected ps2_command
Submitter : Zdenek Kabelac <[email protected]>
Date : 2008-07-31 9:41 (109 days old)
References : http://marc.info/?l=linux-kernel&m=121749737011637&w=4

2008-11-16 19:39:53

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308
Subject : tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
Submitter : Christoph Lameter <[email protected]>
Date : 2008-08-11 18:36 (98 days old)
References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4
http://marc.info/?l=linux-kernel&m=122125737421332&w=4

2008-11-16 19:40:25

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11404
Subject : BUG: in 2.6.23-rc3-git7 in do_cciss_intr
Submitter : rdunlap <[email protected]>
Date : 2008-08-21 5:52 (88 days old)
References : http://marc.info/?l=linux-kernel&m=121929819616273&w=4
http://marc.info/?l=linux-kernel&m=121932889105368&w=4
Handled-By : Miller, Mike (OS Dev) <[email protected]>
James Bottomley <[email protected]>

2008-11-16 19:40:50

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11543] kernel panic: softlockup in tick_periodic() ???

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11543
Subject : kernel panic: softlockup in tick_periodic() ???
Submitter : Joshua Hoblitt <[email protected]>
Date : 2008-09-11 16:46 (67 days old)
References : http://marc.info/?l=linux-kernel&m=122117786124326&w=4
Handled-By : Thomas Gleixner <[email protected]>
Cyrill Gorcunov <[email protected]>
Ingo Molnar <[email protected]>
Cyrill Gorcunov <[email protected]>

2008-11-16 19:41:13

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11569] Panic stop CPUs regression

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11569
Subject : Panic stop CPUs regression
Submitter : Andi Kleen <[email protected]>
Date : 2008-09-02 13:49 (76 days old)
References : http://marc.info/?l=linux-kernel&m=122036356127282&w=4

2008-11-16 19:41:28

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11664] acpi errors and random freeze on sony vaio sr

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11664
Subject : acpi errors and random freeze on sony vaio sr
Submitter : Giovanni Pellerano <[email protected]>
Date : 2008-09-28 03:48 (50 days old)
Patch : &lt;<a href="http://marc.info/?l=linux-acpi&m=122514341319748&w=4

2008-11-16 19:41:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11886] without serial console system doesn't poweroff

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11886
Subject : without serial console system doesn't poweroff
Submitter : Daniel Smolik <[email protected]>
Date : 2008-10-29 04:06 (19 days old)

2008-11-16 19:42:01

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11876] RCU hang on cpu re-hotplug with 2.6.27rc8

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11876
Subject : RCU hang on cpu re-hotplug with 2.6.27rc8
Submitter : Andi Kleen <[email protected]>
Date : 2008-10-06 23:28 (42 days old)
References : http://marc.info/?l=linux-kernel&m=122333610602399&w=2
Handled-By : Paul E. McKenney <[email protected]>

2008-11-16 19:42:29

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11698] 2.6.27-rc7, freezes with &gt; 1 s2ram cycle

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11698
Subject : 2.6.27-rc7, freezes with &gt; 1 s2ram cycle
Submitter : Soeren Sonnenburg <[email protected]>
Date : 2008-09-29 11:29 (49 days old)
References : http://marc.info/?l=linux-kernel&m=122268780926859&w=4
Handled-By : Rafael J. Wysocki <[email protected]>

2008-11-16 19:42:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11836] Scheduler on C2D CPU and latest 2.6.27 kernel

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11836
Subject : Scheduler on C2D CPU and latest 2.6.27 kernel
Submitter : Zdenek Kabelac <[email protected]>
Date : 2008-10-21 9:59 (27 days old)
References : http://marc.info/?l=linux-kernel&m=122458320502371&w=4
Handled-By : Chris Snook <[email protected]>

2008-11-16 19:43:06

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11795] ks959-sir dongle no longer works under 2.6.27 (REGRESSION)

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11795
Subject : ks959-sir dongle no longer works under 2.6.27 (REGRESSION)
Submitter : Alex Villacis Lasso <[email protected]>
Date : 2008-10-20 10:49 (28 days old)
Handled-By : Samuel Ortiz <[email protected]>

2008-11-16 19:43:27

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11865] WOL for E100 Doesn't Work Anymore

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11865
Subject : WOL for E100 Doesn't Work Anymore
Submitter : roger <[email protected]>
Date : 2008-10-26 21:56 (22 days old)
Handled-By : Rafael J. Wysocki <[email protected]>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=18646&action=view

2008-11-16 19:43:43

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11843] usb hdd problems with 2.6.27.2

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11843
Subject : usb hdd problems with 2.6.27.2
Submitter : Luciano Rocha <[email protected]>
Date : 2008-10-22 16:22 (26 days old)
References : http://marc.info/?l=linux-kernel&m=122469318102679&w=4
Handled-By : Luciano Rocha <[email protected]>
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11843#c26

2008-11-16 19:44:00

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11805] mounting XFS produces a segfault

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805
Subject : mounting XFS produces a segfault
Submitter : Tiago Maluta <[email protected]>
Date : 2008-10-21 18:00 (27 days old)
Handled-By : Dave Chinner <[email protected]>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=18397&action=view

2008-11-16 19:44:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #12039] Regression: USB/DVB 2.6.26.8 --&gt; 2.6.27.6

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12039
Subject : Regression: USB/DVB 2.6.26.8 --&gt; 2.6.27.6
Submitter : David <[email protected]>
Date : 2008-11-14 20:20 (3 days old)
References : http://marc.info/?l=linux-kernel&m=122669568022274&w=4

2008-11-16 19:44:42

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11983] iwlagn: wrong command queue 31, command id 0x0

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11983
Subject : iwlagn: wrong command queue 31, command id 0x0
Submitter : Matt Mackall <[email protected]>
Date : 2008-11-06 4:16 (11 days old)
References : http://marc.info/?l=linux-kernel&m=122598672815803&w=4
http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=1703
Handled-By : reinette chatre <[email protected]>

2008-11-16 19:44:58

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #12048] Regression in bonding between 2.6.26.8 and 2.6.27.6

This message has been generated automatically as a part of a report
of regressions introduced between 2.6.26 and 2.6.27.

The following bug entry is on the current list of known regressions
introduced between 2.6.26 and 2.6.27. Please verify if it still should
be listed and let me know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12048
Subject : Regression in bonding between 2.6.26.8 and 2.6.27.6
Submitter : Jesper Krogh <[email protected]>
Date : 2008-11-16 9:41 (1 days old)
References : http://marc.info/?l=linux-kernel&m=122682977001048&w=4

2008-11-16 21:37:52

by Luciano Rocha

[permalink] [raw]
Subject: Re: [Bug #11843] usb hdd problems with 2.6.27.2

On Sun, Nov 16, 2008 at 06:40:59PM +0100, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.26 and 2.6.27.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.26 and 2.6.27. Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11843
> Subject : usb hdd problems with 2.6.27.2
> Submitter : Luciano Rocha <[email protected]>
> Date : 2008-10-22 16:22 (26 days old)
> References : http://marc.info/?l=linux-kernel&m=122469318102679&w=4
> Handled-By : Luciano Rocha <[email protected]>
> Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11843#c26

What does "Handled-By" mean? The patches were created by Alan Stern
<[email protected]>, I just tested them.

Regards,
Luciano Rocha

--
Luciano Rocha <[email protected]>
Eurotux Inform?tica, S.A. <http://www.eurotux.com/>

2008-11-17 09:07:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Rafael J. Wysocki <[email protected]> wrote:

> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.26 and 2.6.27.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.26 and 2.6.27. Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308
> Subject : tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
> Submitter : Christoph Lameter <[email protected]>
> Date : 2008-08-11 18:36 (98 days old)
> References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4
> http://marc.info/?l=linux-kernel&m=122125737421332&w=4

Christoph, as per the recent analysis of Mike:

http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html

all scheduler components of this regression have been eliminated.

In fact his numbers show that scheduler speedups since 2.6.22 have
offset and hidden most other sources of tbench regression. (i.e. the
scheduler portion got 5% faster, hence it was able to offset a
slowdown of 5% in other areas of the kernel that tbench triggers)

Ingo

2008-11-17 09:14:21

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Ingo Molnar <[email protected]>
Date: Mon, 17 Nov 2008 10:06:48 +0100

>
> * Rafael J. Wysocki <[email protected]> wrote:
>
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.26 and 2.6.27.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.26 and 2.6.27. Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308
> > Subject : tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
> > Submitter : Christoph Lameter <[email protected]>
> > Date : 2008-08-11 18:36 (98 days old)
> > References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4
> > http://marc.info/?l=linux-kernel&m=122125737421332&w=4
>
> Christoph, as per the recent analysis of Mike:
>
> http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
>
> all scheduler components of this regression have been eliminated.
>
> In fact his numbers show that scheduler speedups since 2.6.22 have
> offset and hidden most other sources of tbench regression. (i.e. the
> scheduler portion got 5% faster, hence it was able to offset a
> slowdown of 5% in other areas of the kernel that tbench triggers)

Although I respect the improvements, wake_up() is still several orders
of magnitude slower than it was in 2.6.22 and wake_up() is at the top
of the profiles in tbench runs.

It really is premature to close this regression at this time.

I am working with every spare moment I have to try and nail this
stuff, but unless someone else helps me people need to be patient.

2008-11-17 11:02:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Mon, 17 Nov 2008 10:06:48 +0100
>
> >
> > * Rafael J. Wysocki <[email protected]> wrote:
> >
> > > This message has been generated automatically as a part of a report
> > > of regressions introduced between 2.6.26 and 2.6.27.
> > >
> > > The following bug entry is on the current list of known regressions
> > > introduced between 2.6.26 and 2.6.27. Please verify if it still should
> > > be listed and let me know (either way).
> > >
> > >
> > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308
> > > Subject : tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
> > > Submitter : Christoph Lameter <[email protected]>
> > > Date : 2008-08-11 18:36 (98 days old)
> > > References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4
> > > http://marc.info/?l=linux-kernel&m=122125737421332&w=4
> >
> > Christoph, as per the recent analysis of Mike:
> >
> > http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
> >
> > all scheduler components of this regression have been eliminated.
> >
> > In fact his numbers show that scheduler speedups since 2.6.22 have
> > offset and hidden most other sources of tbench regression. (i.e. the
> > scheduler portion got 5% faster, hence it was able to offset a
> > slowdown of 5% in other areas of the kernel that tbench triggers)
>
> Although I respect the improvements, wake_up() is still several
> orders of magnitude slower than it was in 2.6.22 and wake_up() is at
> the top of the profiles in tbench runs.

hm, several orders of magnitude slower? That contradicts Mike's
numbers and my own numbers and profiles as well: see below.

The scheduler's overhead barely even registers on a 16-way x86 system
i'm running tbench on. Here's the NMI profile during 64 threads tbench
on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:

Throughput 3437.65 MB/sec 64 procs
==================================
21570252 total
........
1494803 copy_user_generic_string
998232 sock_rfree
491471 tcp_ack
482405 ip_dont_fragment
470685 ip_local_deliver
436325 constant_test_bit [ called by napi_disable_pending() ]
375469 avc_has_perm_noaudit
347663 tcp_sendmsg
310383 tcp_recvmsg
300412 __inet_lookup_established
294377 system_call
286603 tcp_transmit_skb
251782 selinux_ip_postroute
236028 tcp_current_mss
235631 schedule
234013 netif_rx
229854 _local_bh_enable_ip
219501 tcp_v4_rcv

[ etc. - see full profile attached further below ]

Note that the scheduler does not even show up in the profile up to
entry #15!

I've also summarized NMI profiler output by major subsystems:

NET overhead (12603450/21570252): 58.43%
security overhead ( 1903598/21570252): 8.83%
usercopy overhead ( 1753617/21570252): 8.13%
sched overhead ( 1599406/21570252): 7.41%
syscall overhead ( 560487/21570252): 2.60%
IRQ overhead ( 555439/21570252): 2.58%
slab overhead ( 492421/21570252): 2.28%
timer overhead ( 226573/21570252): 1.05%
pagealloc overhead ( 192681/21570252): 0.89%
PID overhead ( 115123/21570252): 0.53%
VFS overhead ( 107926/21570252): 0.50%
pagecache overhead ( 62552/21570252): 0.29%
gtod overhead ( 38651/21570252): 0.18%
IDLE overhead ( 0/21570252): 0.00%
---------------------------------------------------------
left ( 1349494/21570252): 6.26%

The scheduler's functions are absolutely flat, and consistent with an
extreme context-switching rate of 1.35 million per second. The
scheduler can go up to about 20 million context switches per second on
this system:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
32 0 0 32229696 29308 649880 0 0 0 0 164135 20026853 24 76 0 0 0
32 0 0 32229752 29308 649880 0 0 0 0 164203 20032770 24 76 0 0 0
32 0 0 32229752 29308 649880 0 0 0 0 164201 20036492 25 75 0 0 0

... and 7% scheduling overhead is roughly consistent with 1.35/20.0.

Wake up affinities and data flow caching is just fine in this workload
- we've got scheduler statistics for that and they look good too.

It all looks like pure old-fashioned straight overhead in the
networking layer to me. Do we still touch the same global cacheline
for every localhost packet we process? Anything like that would show
up big time.

Anyway, in terms of scheduling there's absolutely nothing anomalous i
can see about this workload. Scheduling looks healthy throughout - and
the few things we noticed causing unnecessary overhead are now fixed
in -rc5. (but it's all in the <5% range of impact of total scheduling
overhead - i.e. in the 0.4% absolute range in this workload)

And the thing is, the scheduler's task in this workload is by far the
most difficult one conceptually: it has to manage and optimize
concurrency of _future_ processing, with an event frequency that is
_WAY_ out of the normal patterns: more than 1.3 million context
switches per second (!). It also switches to/from completely
independent contexts of computing, with the all the implications that
this brings.

Networking and VFS "just" has to shuffle around bits in memory along a
very specific plan given to it by user-space. That plan is
well-specified and goes along the lines of: "copy this (already
cached) file content to that socket" and back.

By the raw throughput figures the system is pushing a couple of
million data packets per second.

Still we spend 7 times more CPU time in the networking code than in
the scheduler or in the user-copy code. Why?

Ingo

------------------------->
21570252 total
........
1494803 copy_user_generic_string
998232 sock_rfree
491471 tcp_ack
482405 ip_dont_fragment
470685 ip_local_deliver
436325 constant_test_bit
375469 avc_has_perm_noaudit
347663 tcp_sendmsg
310383 tcp_recvmsg
300412 __inet_lookup_established
294377 system_call
286603 tcp_transmit_skb
251782 selinux_ip_postroute
236028 tcp_current_mss
235631 schedule
234013 netif_rx
229854 _local_bh_enable_ip
219501 tcp_v4_rcv
210046 netlbl_enabled
205022 constant_test_bit
199598 skb_release_head_state
187952 ip_queue_xmit
178779 tcp_established_options
175955 dev_queue_xmit
169904 netif_receive_skb
166629 ip_finish_output2
162291 sysret_check
151262 __switch_to
143355 audit_syscall_entry
142694 load_cr3
136571 memset_c
136115 nf_hook_slow
130825 ip_local_deliver_finish
128795 ip_rcv
125995 selinux_socket_sock_rcv_skb
123944 net_rx_action
123100 __copy_skb_header
122052 __inet_lookup
121744 constant_test_bit
119444 get_page_from_freelist
116486 avc_has_perm
115643 audit_syscall_exit
115123 find_pid_ns
114483 tcp_cleanup_rbuf
111350 tcp_rcv_established
109853 __mod_timer
107891 lock_sock_nested
107316 napi_disable_pending
106581 release_sock
104402 skb_copy_datagram_iovec
101591 __tcp_push_pending_frames
101206 tcp_event_data_recv
98046 kmem_cache_alloc_node
97982 tcp_v4_do_rcv
92714 sys_recvfrom
91551 rb_erase
89730 kfree
87979 ip_rcv_finish
87166 compare_ether_addr
86982 selinux_parse_skb
86731 nf_iterate
79690 selinux_ipv4_output
79347 __cache_free
78992 audit_free_names
78127 skb_release_data
77501 mod_timer
77241 __sock_recvmsg
77228 sock_recvmsg
77211 ____cache_alloc
76495 tcp_rcv_space_adjust
75283 sk_wait_data
71772 sys_sendto
71594 sched_clock
70880 eth_type_trans
70238 memcpy_toiovec
69193 do_softirq
68341 __update_sched_clock
67597 tcp_v4_md5_lookup
67424 try_to_wake_up
64465 sock_common_recvmsg
64116 put_prev_task_fair
63964 process_backlog
62216 __do_softirq
62093 tcp_cwnd_validate
61128 __alloc_skb
60588 put_page
59536 dput
58411 __ip_local_out
56349 avc_audit
55626 __napi_schedule
55525 selinux_ipv4_postroute
54499 __enqueue_entity
53599 local_bh_disable
53418 unroll_tree_refs
53162 __unlazy_fpu
53084 cfs_rq_of
52475 set_next_entity
51108 thread_return
50458 ip_output
50268 sched_clock_cpu
49974 tcp_send_delayed_ack
49736 ip_finish_output
49670 finish_task_switch
49070 ___swab16
48499 audit_get_context
48347 raw_local_deliver
47824 tcp_rtt_estimator
46707 tcp_push
46405 constant_test_bit
45859 select_task_rq_fair
45188 math_state_restore
44889 check_preempt_wakeup
44449 task_rq_lock
43704 sel_netif_sid
43377 sock_sendmsg
42612 sk_reset_timer
42606 __skb_clone
42223 __find_general_cachep
41950 selinux_socket_sendmsg
41716 constant_test_bit
41097 skb_push
40723 lock_sock
40715 system_call_after_swapgs
40399 selinux_netlbl_inode_permission
40179 rb_insert_color
40021 __kfree_skb
40015 sockfd_lookup_light
39216 internal_add_timer
39024 skb_can_coalesce
38838 __tcp_select_window
38651 current_kernel_time
38533 tcp_v4_md5_do_lookup
38372 __sock_sendmsg
38162 selinux_socket_recvmsg
37812 sel_netport_sid
37727 account_group_exec_runtime
37695 switch_mm
36247 nf_hook_thresh
36057 auditsys
35266 pick_next_task_fair
35064 __tcp_ack_snd_check
35052 sock_def_readable
34826 sysret_careful
34578 _local_bh_enable
34498 free_hot_cold_page
34338 kmap
34028 loopback_xmit
33320 sk_stream_alloc_skb
33269 test_ti_thread_flag
33219 skb_fill_page_desc
33049 tcp_is_cwnd_limited
33012 update_min_vruntime
32431 native_read_tsc
32398 dst_release
31661 get_pageblock_flags_group
31652 path_put
31516 tcp_push_pending_frames
31265 netif_needs_gso
31175 constant_test_bit
31077 __cycles_2_ns
30971 socket_has_perm
30893 __phys_addr
30867 lock_timer_base
30585 __wake_up
30456 ret_from_sys_call
30147 skb_release_all
29356 local_bh_enable
29334 __skb_insert
28681 tcp_cwnd_test
28652 __skb_dequeue
28612 prepare_to_wait
28268 kmem_cache_free
28193 set_bit
28149 dequeue_task_fair
27906 skb_header_pointer
27861 sys_kill
27803 selinux_task_kill
27627 audit_free_aux
27600 selinux_netlbl_sock_rcv_skb
26794 update_curr
26777 __alloc_pages_internal
26469 skb_entail
26458 pskb_may_pull
26216 inet_ehashfn
26075 call_softirq
26033 copy_from_user
25933 __local_bh_disable
25666 fget_light
25270 inet_csk_reset_xmit_timer
25071 signal_pending_state
24117 tcp_init_tso_segs
24109 TCP_ECN_check_ce
23702 nf_hook_thresh
23558 copy_to_user
23426 sysret_audit
23267 sk_wake_async
22627 tcp_options_write
22174 netif_tx_queue_stopped
21795 tcp_prequeue_process
21757 tcp_set_skb_tso_segs
21579 avc_hash
21565 ___swab16
21560 ip_local_out
21445 sk_wmem_schedule
21234 get_page
21200 __wake_up_common
21042 sel_netnode_find
20772 sock_put
20625 schedule_timeout
20613 __napi_complete
20563 fput_light
20532 tcp_bound_to_half_wnd
19912 cap_task_kill
19773 sysret_signal
19374 compound_head
19121 get_seconds
19048 PageLRU
18893 zone_watermark_ok
18635 tcp_snd_wnd_test
18634 enqueue_task_fair
18603 rb_next
18598 next_zones_zonelist
18534 resched_task
17820 hash_64
17801 autoremove_wake_function
17451 __skb_queue_before
17283 native_load_tls
17227 __skb_dequeue
17149 xfrm4_policy_check
16942 zone_statistics
16886 skb_reset_network_header
16824 ___swab16
16725 pskb_may_pull
16645 dev_hard_start_xmit
16580 sk_filter
16523 tcp_ca_event
16479 tcp_win_from_space
16408 tcp_parse_aligned_timestamp
16204 finish_wait
16124 virt_to_slab
15965 tcp_v4_send_check
15920 skb_reset_transport_header
15867 tcp_data_snd_check
15819 security_sock_rcv_skb
15665 tcp_ack_saw_tstamp
15621 skb_network_offset
15568 virt_to_head_page
15553 dst_confirm
15320 skb_pull
15277 clear_bit
15179 alloc_pages_current
14991 bictcp_acked
14743 tcp_store_ts_recent
14660 sel_netnode_sid
14650 __xchg
14573 task_has_perm
14561 tcp_v4_check
14492 net_invalid_timestamp
14485 security_socket_recvmsg
14363 __dequeue_entity
14318 pid_nr_ns
14311 device_not_available
14212 local_bh_enable_ip
14092 virt_to_cache
13804 netpoll_rx
13781 fcheck_files
13724 tcp_adjust_fackets_out
13717 net_timestamp
13638 ___swab16
13576 sel_netport_find
13563 __kmalloc_node
13530 __inc_zone_state
13215 pid_vnr
13208 free_pages_check
13008 security_socket_sendmsg
12971 ip_skb_dst_mtu
12827 __cpu_set
12782 bictcp_cong_avoid
12779 test_tsk_thread_flag
12734 wakeup_preempt_entity
12651 sel_netif_find
12545 skb_set_owner_r
12534 skb_headroom
12348 tcp_event_new_data_sent
12251 place_entity
12047 set_bit
11805 update_rq_clock
11788 detach_timer
11659 policy_zonelist
11423 skb_clone
11380 __skb_queue_tail
11249 dequeue_task
10823 init_rootdomain
10690 __cpu_clear
10558 default_wake_function
10556 tcp_rcv_rtt_measure_ts
10451 PageSlab
10427 sock_wfree
10277 calc_delta_fair
10237 tcp_validate_incoming
10218 task_rq_unlock
10023 page_get_cache


Attachments:
(No filename) (13.43 kB)
config (71.21 kB)
Download all attachments

2008-11-17 11:22:25

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Ingo Molnar a ?crit :
> * David Miller <[email protected]> wrote:
>
>> From: Ingo Molnar <[email protected]>
>> Date: Mon, 17 Nov 2008 10:06:48 +0100
>>
>>> * Rafael J. Wysocki <[email protected]> wrote:
>>>
>>>> This message has been generated automatically as a part of a report
>>>> of regressions introduced between 2.6.26 and 2.6.27.
>>>>
>>>> The following bug entry is on the current list of known regressions
>>>> introduced between 2.6.26 and 2.6.27. Please verify if it still should
>>>> be listed and let me know (either way).
>>>>
>>>>
>>>> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308
>>>> Subject : tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28
>>>> Submitter : Christoph Lameter <[email protected]>
>>>> Date : 2008-08-11 18:36 (98 days old)
>>>> References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4
>>>> http://marc.info/?l=linux-kernel&m=122125737421332&w=4
>>> Christoph, as per the recent analysis of Mike:
>>>
>>> http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
>>>
>>> all scheduler components of this regression have been eliminated.
>>>
>>> In fact his numbers show that scheduler speedups since 2.6.22 have
>>> offset and hidden most other sources of tbench regression. (i.e. the
>>> scheduler portion got 5% faster, hence it was able to offset a
>>> slowdown of 5% in other areas of the kernel that tbench triggers)
>> Although I respect the improvements, wake_up() is still several
>> orders of magnitude slower than it was in 2.6.22 and wake_up() is at
>> the top of the profiles in tbench runs.
>
> hm, several orders of magnitude slower? That contradicts Mike's
> numbers and my own numbers and profiles as well: see below.
>
> The scheduler's overhead barely even registers on a 16-way x86 system
> i'm running tbench on. Here's the NMI profile during 64 threads tbench
> on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:
>
> Throughput 3437.65 MB/sec 64 procs
> ==================================
> 21570252 total
> ........
> 1494803 copy_user_generic_string
> 998232 sock_rfree
> 491471 tcp_ack
> 482405 ip_dont_fragment
> 470685 ip_local_deliver
> 436325 constant_test_bit [ called by napi_disable_pending() ]
> 375469 avc_has_perm_noaudit
> 347663 tcp_sendmsg
> 310383 tcp_recvmsg
> 300412 __inet_lookup_established
> 294377 system_call
> 286603 tcp_transmit_skb
> 251782 selinux_ip_postroute
> 236028 tcp_current_mss
> 235631 schedule
> 234013 netif_rx
> 229854 _local_bh_enable_ip
> 219501 tcp_v4_rcv
>
> [ etc. - see full profile attached further below ]
>
> Note that the scheduler does not even show up in the profile up to
> entry #15!
>
> I've also summarized NMI profiler output by major subsystems:
>
> NET overhead (12603450/21570252): 58.43%
> security overhead ( 1903598/21570252): 8.83%
> usercopy overhead ( 1753617/21570252): 8.13%
> sched overhead ( 1599406/21570252): 7.41%
> syscall overhead ( 560487/21570252): 2.60%
> IRQ overhead ( 555439/21570252): 2.58%
> slab overhead ( 492421/21570252): 2.28%
> timer overhead ( 226573/21570252): 1.05%
> pagealloc overhead ( 192681/21570252): 0.89%
> PID overhead ( 115123/21570252): 0.53%
> VFS overhead ( 107926/21570252): 0.50%
> pagecache overhead ( 62552/21570252): 0.29%
> gtod overhead ( 38651/21570252): 0.18%
> IDLE overhead ( 0/21570252): 0.00%
> ---------------------------------------------------------
> left ( 1349494/21570252): 6.26%
>
> The scheduler's functions are absolutely flat, and consistent with an
> extreme context-switching rate of 1.35 million per second. The
> scheduler can go up to about 20 million context switches per second on
> this system:
>
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 32 0 0 32229696 29308 649880 0 0 0 0 164135 20026853 24 76 0 0 0
> 32 0 0 32229752 29308 649880 0 0 0 0 164203 20032770 24 76 0 0 0
> 32 0 0 32229752 29308 649880 0 0 0 0 164201 20036492 25 75 0 0 0
>
> ... and 7% scheduling overhead is roughly consistent with 1.35/20.0.
>
> Wake up affinities and data flow caching is just fine in this workload
> - we've got scheduler statistics for that and they look good too.
>
> It all looks like pure old-fashioned straight overhead in the
> networking layer to me. Do we still touch the same global cacheline
> for every localhost packet we process? Anything like that would show
> up big time.

Yes we do, I find strange we dont see dst_release() in your NMI profile

I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387
net: make sure struct dst_entry refcount is aligned on 64 bytes)
(in net-next-2.6 tree)
to properly align struct dst_entry refcounter and got 4% speedup on tbench on my machine.

Small speedups too with commit ef711cf1d156428d4c2911b8c86c6ce90519dc45
(net: speedup dst_release())

Also on net-next-2.6, patches avoid dirtying last_rx on netdevices (loopback for example)
, it helps a lot tbench too.

2008-11-17 14:45:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Bug #11805] mounting XFS produces a segfault

On Sun, Nov 16, 2008 at 06:40:58PM +0100, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.26 and 2.6.27.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.26 and 2.6.27. Please verify if it still should
> be listed and let me know (either way).

The patch for this is both in mainline and -stable

> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11805
> Subject : mounting XFS produces a segfault
> Submitter : Tiago Maluta <[email protected]>
> Date : 2008-10-21 18:00 (27 days old)
> Handled-By : Dave Chinner <[email protected]>

And that email address for Dave is severly outdated.

2008-11-17 16:12:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Eric Dumazet <[email protected]> wrote:

>> It all looks like pure old-fashioned straight overhead in the
>> networking layer to me. Do we still touch the same global cacheline
>> for every localhost packet we process? Anything like that would
>> show up big time.
>
> Yes we do, I find strange we dont see dst_release() in your NMI
> profile
>
> I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387
> net: make sure struct dst_entry refcount is aligned on 64 bytes) (in
> net-next-2.6 tree) to properly align struct dst_entry refcounter and
> got 4% speedup on tbench on my machine.

Ouch, +4% from a oneliner networking change? That's a _huge_ speedup
compared to the things we were after in scheduler land. A lot of
scheduler folks worked hard to squeeze the last 1-2% out of the
scheduler fastpath (which was not trivial at all). The _full_
scheduler accounts for only about 7% of the total system overhead here
on a 16-way box...

So why should we be handling this anything but a plain networking
performance regression/weakness? The localhost scalability bottleneck
has been reported a _long_ time ago.

Ingo

2008-11-17 16:20:33

by Randy Dunlap

[permalink] [raw]
Subject: Re: [Bug #11404] BUG: in 2.6.23-rc3-git7 in do_cciss_intr

Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.26 and 2.6.27.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.26 and 2.6.27. Please verify if it still should
> be listed and let me know (either way).
>

Nothing has changed. IMO that means leave the bug as is (alive).

>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11404
> Subject : BUG: in 2.6.23-rc3-git7 in do_cciss_intr
> Submitter : rdunlap <[email protected]>
> Date : 2008-08-21 5:52 (88 days old)
> References : http://marc.info/?l=linux-kernel&m=121929819616273&w=4
> http://marc.info/?l=linux-kernel&m=121932889105368&w=4
> Handled-By : Miller, Mike (OS Dev) <[email protected]>
> James Bottomley <[email protected]>

--
~Randy

2008-11-17 16:35:31

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Ingo Molnar a ?crit :
> * Eric Dumazet <[email protected]> wrote:
>
>>> It all looks like pure old-fashioned straight overhead in the
>>> networking layer to me. Do we still touch the same global cacheline
>>> for every localhost packet we process? Anything like that would
>>> show up big time.
>> Yes we do, I find strange we dont see dst_release() in your NMI
>> profile
>>
>> I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387
>> net: make sure struct dst_entry refcount is aligned on 64 bytes) (in
>> net-next-2.6 tree) to properly align struct dst_entry refcounter and
>> got 4% speedup on tbench on my machine.
>
> Ouch, +4% from a oneliner networking change? That's a _huge_ speedup
> compared to the things we were after in scheduler land. A lot of
> scheduler folks worked hard to squeeze the last 1-2% out of the
> scheduler fastpath (which was not trivial at all). The _full_
> scheduler accounts for only about 7% of the total system overhead here
> on a 16-way box...

4% on my machine, but apparently my machine is sooooo special (see oprofile thread),
so maybe its cpus have a hard time playing with a contended cache line.

It definitly needs more testing on other machines.

Maybe you'll discover patch is bad on your machines, this is why it's in
net-next-2.6

>
> So why should we be handling this anything but a plain networking
> performance regression/weakness? The localhost scalability bottleneck
> has been reported a _long_ time ago.
>

struct dst_entry problem was already discovered a _long_ time ago
and probably solved at this time.

(commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17
Thu, 13 Mar 2008 05:52:37 +0000 (22:52 -0700)
[NET]: Fix tbench regression in 2.6.25-rc1)

Then, a gremlin came and broke the thing.

They are many contended cache lines in the system, we can do our
best to try to make them disappear. Thats not always possible.

Another contended cache line is the rwlock in iptables.
I remember Stephen had a patch to make the thing use RCU.

2008-11-17 17:09:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Eric Dumazet <[email protected]> wrote:

> Ingo Molnar a ?crit :
>> * Eric Dumazet <[email protected]> wrote:
>>
>>>> It all looks like pure old-fashioned straight overhead in the
>>>> networking layer to me. Do we still touch the same global cacheline
>>>> for every localhost packet we process? Anything like that would
>>>> show up big time.
>>> Yes we do, I find strange we dont see dst_release() in your NMI
>>> profile
>>>
>>> I posted a patch ( commit 5635c10d976716ef47ae441998aeae144c7e7387
>>> net: make sure struct dst_entry refcount is aligned on 64 bytes) (in
>>> net-next-2.6 tree) to properly align struct dst_entry refcounter and
>>> got 4% speedup on tbench on my machine.
>>
>> Ouch, +4% from a oneliner networking change? That's a _huge_ speedup
>> compared to the things we were after in scheduler land. A lot of
>> scheduler folks worked hard to squeeze the last 1-2% out of the
>> scheduler fastpath (which was not trivial at all). The _full_
>> scheduler accounts for only about 7% of the total system overhead here
>> on a 16-way box...
>
> 4% on my machine, but apparently my machine is sooooo special (see
> oprofile thread), so maybe its cpus have a hard time playing with a
> contended cache line.
>
> It definitly needs more testing on other machines.
>
> Maybe you'll discover patch is bad on your machines, this is why
> it's in net-next-2.6

ok, i'll try it on my testbox too, to check whether it has any effect
- find below the port to -git.

tbench _is_ very sensitive to seemingly small details - it seems to be
hoovering at around some sort of CPU cache boundary and penalizing
random alignment changes, as we drop in and out of the sweet spot.

Mike Galbraith has been spending months trying to pin down all the
issues.

Ingo

------------->
>From 8fbd307d402647b07c3c2662fdac589494d16e5e Mon Sep 17 00:00:00 2001
From: Eric Dumazet <[email protected]>
Date: Sun, 16 Nov 2008 19:46:36 -0800
Subject: [PATCH] net: make sure struct dst_entry refcount is aligned on 64 bytes

As found in the past (commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17
[NET]: Fix tbench regression in 2.6.25-rc1), it is really
important that struct dst_entry refcount is aligned on a cache line.

We cannot use __atribute((aligned)), so manually pad the structure
for 32 and 64 bit arches.

for 32bit : offsetof(truct dst_entry, __refcnt) is 0x80
for 64bit : offsetof(truct dst_entry, __refcnt) is 0xc0

As it is not possible to guess at compile time cache line size,
we use a generic value of 64 bytes, that satisfies many current arches.
(Using 128 bytes alignment on 64bit arches would waste 64 bytes)

Add a BUILD_BUG_ON to catch future updates to "struct dst_entry" dont
break this alignment.

"tbench 8" is 4.4 % faster on a dual quad core (HP BL460c G1), Intel E5450 @3.00GHz
(2350 MB/s instead of 2250 MB/s)

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
---
include/net/dst.h | 21 +++++++++++++++++++++
1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 8a8b71e..1b4de18 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -59,7 +59,11 @@ struct dst_entry

struct neighbour *neighbour;
struct hh_cache *hh;
+#ifdef CONFIG_XFRM
struct xfrm_state *xfrm;
+#else
+ void *__pad1;
+#endif

int (*input)(struct sk_buff*);
int (*output)(struct sk_buff*);
@@ -70,8 +74,20 @@ struct dst_entry

#ifdef CONFIG_NET_CLS_ROUTE
__u32 tclassid;
+#else
+ __u32 __pad2;
#endif

+
+ /*
+ * Align __refcnt to a 64 bytes alignment
+ * (L1_CACHE_SIZE would be too much)
+ */
+#ifdef CONFIG_64BIT
+ long __pad_to_align_refcnt[2];
+#else
+ long __pad_to_align_refcnt[1];
+#endif
/*
* __refcnt wants to be on a different cache line from
* input/output/ops or performance tanks badly
@@ -157,6 +173,11 @@ dst_metric_locked(struct dst_entry *dst, int metric)

static inline void dst_hold(struct dst_entry * dst)
{
+ /*
+ * If your kernel compilation stops here, please check
+ * __pad_to_align_refcnt declaration in struct dst_entry
+ */
+ BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
atomic_inc(&dst->__refcnt);
}

2008-11-17 17:26:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

> > 4% on my machine, but apparently my machine is sooooo special (see
> > oprofile thread), so maybe its cpus have a hard time playing with
> > a contended cache line.
> >
> > It definitly needs more testing on other machines.
> >
> > Maybe you'll discover patch is bad on your machines, this is why
> > it's in net-next-2.6
>
> ok, i'll try it on my testbox too, to check whether it has any effect
> - find below the port to -git.

it gives a small speedup of ~1% on my box:

before: Throughput 3437.65 MB/sec 64 procs
after: Throughput 3473.99 MB/sec 64 procs

... although that's still a bit close to the natural tbench noise
range so it's not conclusive and not like a smoking gun IMO.

But i think this change might just be papering over the real
scalability problem that this workload has in my opinion: that there's
a single localhost route/dst/device that millions of packets are
squeezed through every second:

phoenix:~> ifconfig lo
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:258001524 errors:0 dropped:0 overruns:0 frame:0
TX packets:258001524 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:679809512144 (633.1 GiB) TX bytes:679809512144 (633.1 GiB)

There does not seem to be any per CPU ness in localhost networking -
it has a globally single-threaded rx/tx queue AFAICS even if both the
client and server task is on the same CPU - how is that supposed to
perform well? (but i might be missing something)

What kind of test-system do you have - one with P4 style Xeon CPUs
perhaps where dirty-cacheline cachemisses to DRAM were particularly
expensive?

Ingo

2008-11-17 17:34:18

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Ingo Molnar a ?crit :
> * Ingo Molnar <[email protected]> wrote:
>
>>> 4% on my machine, but apparently my machine is sooooo special (see
>>> oprofile thread), so maybe its cpus have a hard time playing with
>>> a contended cache line.
>>>
>>> It definitly needs more testing on other machines.
>>>
>>> Maybe you'll discover patch is bad on your machines, this is why
>>> it's in net-next-2.6
>> ok, i'll try it on my testbox too, to check whether it has any effect
>> - find below the port to -git.
>
> it gives a small speedup of ~1% on my box:
>
> before: Throughput 3437.65 MB/sec 64 procs
> after: Throughput 3473.99 MB/sec 64 procs

Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8"

>
> ... although that's still a bit close to the natural tbench noise
> range so it's not conclusive and not like a smoking gun IMO.
>
> But i think this change might just be papering over the real
> scalability problem that this workload has in my opinion: that there's
> a single localhost route/dst/device that millions of packets are
> squeezed through every second:

Yes, this point was mentioned on netdev a while back.

>
> phoenix:~> ifconfig lo
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:258001524 errors:0 dropped:0 overruns:0 frame:0
> TX packets:258001524 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:679809512144 (633.1 GiB) TX bytes:679809512144 (633.1 GiB)
>
> There does not seem to be any per CPU ness in localhost networking -
> it has a globally single-threaded rx/tx queue AFAICS even if both the
> client and server task is on the same CPU - how is that supposed to
> perform well? (but i might be missing something)

Stephen had a patch for this one too, but we got tbench noise too with this patch

http://kerneltrap.org/mailarchive/linux-netdev/2008/11/5/3926034


>
> What kind of test-system do you have - one with P4 style Xeon CPUs
> perhaps where dirty-cacheline cachemisses to DRAM were particularly
> expensive?

Its a HP BL460c g1

Dual quad-core cpus Intel E5450 @3.00GHz

So 8 logical cpus. My bench was "tbench 8"

2008-11-17 17:40:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28



On Mon, 17 Nov 2008, Eric Dumazet wrote:

> Ingo Molnar a ?crit :

> > it gives a small speedup of ~1% on my box:
> >
> > before: Throughput 3437.65 MB/sec 64 procs
> > after: Throughput 3473.99 MB/sec 64 procs
>
> Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8"

I think Ingo may have a Nehalem. Let's just say that those things rock,
and have rather good memory throughput.

Linus

2008-11-17 17:44:00

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Linus Torvalds a ?crit :
>
> On Mon, 17 Nov 2008, Eric Dumazet wrote:
>
>> Ingo Molnar a ?crit :
>
>>> it gives a small speedup of ~1% on my box:
>>>
>>> before: Throughput 3437.65 MB/sec 64 procs
>>> after: Throughput 3473.99 MB/sec 64 procs
>> Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8"
>
> I think Ingo may have a Nehalem. Let's just say that those things rock,
> and have rather good memory throughput.
>

I want one :)

Or even two of them :)

2008-11-17 18:24:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Linus Torvalds <[email protected]> wrote:

> On Mon, 17 Nov 2008, Eric Dumazet wrote:
>
> > Ingo Molnar a ?crit :
>
> > > it gives a small speedup of ~1% on my box:
> > >
> > > before: Throughput 3437.65 MB/sec 64 procs
> > > after: Throughput 3473.99 MB/sec 64 procs
> >
> > Strange, I get 2350 MB/sec on my 8 cpus box. "tbench 8"
>
> I think Ingo may have a Nehalem. Let's just say that those things
> rock, and have rather good memory throughput.

hm, i'm not sure whether i can post benchmarks from the Nehalem box -
but i can confirm it in general terms that it's rather nice ;-)

This was run on another testbox (4x4 Barcelona) that rocks similarly
well in terms of memory subsystem latencies: which seems to be
tbench's main current critical path.

For the tbench bragging rights i'd probably turn off CONFIG_SECURITY
and a few other options. Plus i'd run with 16 threads only - in this
test i ran with 4x overload (64 tbench threads, not 16) to stress the
scheduler harder.

Although we degrade very gently with overload so the numbers arent all
that much different:

16 threads: Throughput 3463.14 MB/sec 16 procs
64 threads: Throughput 3473.99 MB/sec 64 procs
256 threads: Throughput 3457.67 MB/sec 256 procs
1024 threads: Throughput 3448.85 MB/sec 1024 procs

[ so it's the same within noise range. ]

1024 threads is already a massive 64x overload so beyond any
reasonable limit of workload sanity.

Which suggests that the main limitation factor is cacheline ping-pong
that is already in full effect at 16 threads.

Which is supported by the "most expensive instructions" top-10 sorted
list:

RIP #hits
..........................

[ usercopy ]
ffffffff80350fcd: 1373300 f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi)

ffffffff804a2f33: <sock_rfree>:
ffffffff804a2f34: 985253 48 89 e5 mov %rsp,%rbp


ffffffff804d2eb7: <ip_local_deliver>:
ffffffff804d2eb8: 432659 48 89 e5 mov %rsp,%rbp

ffffffff804aa23c: <constant_test_bit>: [ => napi_disable_pending() ]
ffffffff804aa24c: 374052 89 d1 mov %edx,%ecx

ffffffff804d5076: <ip_dont_fragment>:
ffffffff804d5076: 310051 8a 97 56 02 00 00 mov 0x256(%rdi),%dl

ffffffff804d9b17: <__inet_lookup_established>:
ffffffff804d9bdf: 247224 eb ba jmp ffffffff804d9b9b <__inet_lookup_established+0x84>

ffffffff80321529: <selinux_ip_postroute>:
ffffffff8032152a: 183700 48 89 e5 mov %rsp,%rbp

ffffffff8020c020: <system_call>:
ffffffff8020c020: 183600 0f 01 f8 swapgs

ffffffff8051884a: <netlbl_enabled>:
ffffffff8051884a: 179538 55 push %rbp

The usual profiling caveat applies: it's not _these_ instructions that
matter, but the surrounding code that calls them. Profiling overhead
is delayed by a couple of instructions - the more out-of-order a CPU
is, the larger this delay can be. But even a quick look to the list
above shows that all of the heavy cachemisses are generated by
networking.

Beyond the usual suspects of syscall entry and memcpy, it's only
networking. We dont even have the mov %cr3 TLB flush overhead in this
list, load_cr3() is a distant #30:

ffffffff8023049f: 0 0f 22 d8 mov %rax,%cr3
ffffffff802304a2: 126303 c9 leaveq

The place for the sock_rfree() hit looks a bit weird, and i'll
investigate it now a bit more to place the real overhead point
properly. (i already mapped the test-bit overhead: that comes from
napi_disable_pending())

The first entry is 10x the cost of the last entry in the list so
clearly we've got 1-2 brutal cacheline ping-pongs that dominate the
overhead of this workload.

Ingo

2008-11-17 18:33:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28



On Mon, 17 Nov 2008, Ingo Molnar wrote:
>
> hm, i'm not sure whether i can post benchmarks from the Nehalem box -
> but i can confirm it in general terms that it's rather nice ;-)

Intel released the NDA from various web sites a week or two ago, and Intel
is now selling it in the US (I think today was in fact the official
launch), so I think benchmarks are safe - you can buy the dang things on
the street.

I don't know what availability is, of course. But I doubt that Intel would
mind Nehalem benchmarks even if it were a paper launch - at least from my
personal experience, I've not seen any bad behavior (and plenty of good).

> This was run on another testbox (4x4 Barcelona) that rocks similarly
> well in terms of memory subsystem latencies: which seems to be
> tbench's main current critical path.

Ahh, ok.

Linus

2008-11-17 18:50:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

4> The place for the sock_rfree() hit looks a bit weird, and i'll
> investigate it now a bit more to place the real overhead point
> properly. (i already mapped the test-bit overhead: that comes from
> napi_disable_pending())

ok, here's a new set of profiles. (again for tbench 64-thread on a
16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i
posted before.)

Here are the per major subsystem percentages:

NET overhead ( 5786945/10096751): 57.31%
security overhead ( 925933/10096751): 9.17%
usercopy overhead ( 837887/10096751): 8.30%
sched overhead ( 753662/10096751): 7.46%
syscall overhead ( 268809/10096751): 2.66%
IRQ overhead ( 266500/10096751): 2.64%
slab overhead ( 180258/10096751): 1.79%
timer overhead ( 92986/10096751): 0.92%
pagealloc overhead ( 87381/10096751): 0.87%
VFS overhead ( 53295/10096751): 0.53%
PID overhead ( 44469/10096751): 0.44%
pagecache overhead ( 33452/10096751): 0.33%
gtod overhead ( 11064/10096751): 0.11%
IDLE overhead ( 0/10096751): 0.00%
---------------------------------------------------------
left ( 753878/10096751): 7.47%

The breakdown is very similar to what i sent before, within noise.

[ 'left' is random overhead from all around the place - i categorized
the 500 most expensive functions in the profile per subsystem.
I stopped short of doing it for all 1300+ functions: it's rather
laborous manual work even with hefty use of regex patterns.
It's also less meaningful in practice: the trend in the first 500
functions is present in the remaining 800 functions as well. I
watched the breakdown evolve as i increased the coverage - in
practice it is the first 100 functions that matter - it just doesnt
change after that. ]

The readprofile output below seems structured in a more useful way now
- i tweaked compiler options to have the profiler hits spread out in a
more meaningful way. I collected 10 million NMI profiler hits, and
normalized the readprofile output up to 100%.

[ I'll post per function analysis as i complete them, as a reply to
this mail. ]

Ingo

100.000000 total
................
7.253355 copy_user_generic_string
3.934833 avc_has_perm_noaudit
3.356152 ip_queue_xmit
3.038025 skb_release_data
2.118525 skb_release_head_state
1.997533 tcp_ack
1.833688 tcp_recvmsg
1.717771 eth_type_trans
1.673249 __inet_lookup_established
1.508888 system_call
1.469183 tcp_current_mss
1.431553 tcp_transmit_skb
1.385125 tcp_sendmsg
1.327643 tcp_v4_rcv
1.292328 nf_hook_thresh
1.203205 schedule
1.059501 nf_hook_slow
1.027373 constant_test_bit
0.945183 sock_rfree
0.922748 __switch_to
0.911605 netif_rx
0.876270 register_gifconf
0.788200 ip_local_deliver_finish
0.781467 dev_queue_xmit
0.766530 constant_test_bit
0.758208 _local_bh_enable_ip
0.747184 load_cr3
0.704341 memset_c
0.671260 sysret_check
0.651845 ip_finish_output2
0.620204 audit_free_names
0.617781 audit_syscall_exit
0.615149 skb_copy_datagram_iovec
0.613848 selinux_socket_sock_rcv_skb
0.606995 constant_test_bit
0.593936 __tcp_push_pending_frames
0.592198 tcp_cleanup_rbuf
0.574093 ip_rcv
0.567886 netif_receive_skb
0.563377 get_page_from_freelist
0.557657 tcp_event_data_recv
0.539274 ip_local_deliver
0.534130 sys_recvfrom
0.512321 __tcp_select_window
0.498427 tcp_rcv_established
0.494862 sys_sendto
0.487473 audit_syscall_entry
0.478495 sched_clock_cpu
0.474861 kfree
0.466310 tcp_established_options
0.461384 net_rx_action
0.447162 __mod_timer
0.442078 ip_rcv_finish
0.441631 find_pid_ns
0.441124 sk_wait_data
0.423943 __sock_recvmsg
0.422126 selinux_parse_skb
0.417975 __napi_schedule
0.414082 __do_softirq
0.403604 task_rq_lock
0.380792 nf_iterate
0.377614 select_task_rq_fair
0.374973 sock_sendmsg
0.374635 kmem_cache_alloc_node
0.368775 avc_has_perm
0.368706 local_bh_disable
0.361834 release_sock
0.346400 sock_common_recvmsg
0.342825 skb_clone
0.338704 __alloc_skb
0.326488 do_softirq
0.323410 lock_sock_nested
0.322129 __copy_skb_header
0.316835 put_page
0.310966 selinux_ip_postroute
0.306229 sel_netport_sid
0.299863 try_to_wake_up
0.296288 process_backlog
0.294818 __inet_lookup
0.294778 thread_return
0.293219 cfs_rq_of
0.292315 internal_add_timer
0.292305 tcp_rcv_space_adjust
0.281053 constant_test_bit
0.278779 local_bh_enable
0.272910 *unknown*
0.269593 schedule_timeout
0.261846 tcp_v4_md5_lookup
0.260992 __ip_local_out
0.255868 __enqueue_entity
0.253931 avc_audit
0.252004 finish_task_switch
0.249263 audit_get_context
0.248290 sockfd_lookup_light
0.247416 virt_to_head_page
0.244149 tcp_options_write
0.243603 memcpy_toiovec
0.243434 sock_recvmsg
0.242599 call_softirq
0.242391 __unlazy_fpu
0.236412 fput_light
0.235628 ret_from_sys_call
0.234933 sk_reset_timer
0.228358 math_state_restore
0.227117 socket_has_perm
0.223492 virt_to_cache
0.219063 __cache_free
0.216401 update_curr
0.216232 tcp_v4_send_check
0.213978 audit_free_aux
0.213223 tcp_v4_do_rcv
0.212975 __kfree_skb
0.211137 dev_hard_start_xmit
0.209052 tcp_rtt_estimator
0.207999 netif_needs_gso
0.207662 __update_sched_clock
0.207284 rb_erase
0.204861 enqueue_task_fair
0.203490 skb_release_all
0.203252 tcp_send_delayed_ack
0.203232 inet_ehashfn
0.199846 sel_netport_find
0.195396 system_call_after_swapgs
0.186756 lock_timer_base
0.186687 pick_next_task_fair
0.183986 mod_timer
0.182982 loopback_xmit
0.182605 native_read_tsc
0.181195 skb_set_owner_r
0.179248 switch_mm
0.175584 set_next_entity
0.173329 raw_local_deliver
0.171641 sys_kill
0.164510 dequeue_task_fair
0.161938 clear_bit
0.160528 sock_def_readable
0.157628 __tcp_ack_snd_check
0.156893 skb_can_coalesce
0.156556 tcp_snd_wnd_test
0.155662 ip_output
0.150627 sk_stream_alloc_skb
0.150219 cpu_sdc
0.149425 sysret_careful
0.148760 tcp_data_snd_check
0.147816 auditsys
0.147419 pskb_may_pull
0.147151 fget_light
0.143774 tcp_cwnd_test
0.143029 rb_insert_color
0.142265 __wake_up
0.141808 tcp_bound_to_half_wnd
0.138600 __sk_dst_check
0.138431 free_hot_cold_page
0.137954 unroll_tree_refs
0.137080 __skb_unlink
0.135124 __sock_sendmsg
0.135064 get_pageblock_flags_group
0.132701 kmem_cache_free
0.128152 bictcp_cong_avoid
0.127874 __napi_complete
0.127527 ____cache_alloc
0.127368 tcp_is_cwnd_limited
0.127278 find_vpid
0.126941 constant_test_bit
0.126504 sk_mem_charge
0.126255 __alloc_pages_internal
0.125977 dst_release
0.125521 hash_64
0.124895 put_prev_task_fair
0.123802 netlbl_enabled
0.122829 sched_clock
0.122640 skb_push
0.122035 __phys_addr
0.121161 dput
0.120515 tcp_prequeue_process
0.118916 __skb_dequeue
0.117715 selinux_socket_sendmsg
0.117536 __inc_zone_state
0.115907 sk_wake_async
0.113504 selinux_ipv4_output
0.113017 sel_netif_sid
0.112431 skb_reset_network_header
0.111170 check_preempt_wakeup
0.111061 bictcp_acked
0.110882 sel_netnode_find
0.109978 update_min_vruntime
0.109889 resched_task
0.109879 current_kernel_time
0.109432 tcp_checksum_complete_user
0.107476 ip_dont_fragment
0.107386 sysret_audit
0.106979 inet_csk_reset_xmit_timer
0.106006 skb_entail
0.105777 sysret_signal
0.105420 avc_hash
0.105251 __skb_clone
0.105211 tcp_init_tso_segs
0.103523 __dequeue_entity
0.101715 PageLRU
0.101378 tcp_parse_aligned_timestamp
0.101219 __xchg
0.100544 constant_test_bit
0.097991 __kmalloc
0.097584 test_tsk_thread_flag
0.097475 autoremove_wake_function
0.095747 selinux_task_kill
0.094416 get_page
0.093353 dequeue_task
0.092728 __local_bh_disable
0.091943 selinux_netlbl_sock_rcv_skb
0.091655 path_put
0.090970 skb_headroom
0.090950 PageTail
0.090642 dst_destroy
0.090523 netpoll_rx
0.089589 skb_header_pointer
0.085935 security_socket_recvmsg
0.084008 alloc_pages_current
0.083184 compare_ether_addr
0.082479 rb_next
0.082439 sk_wmem_schedule
0.081635 next_zones_zonelist
0.080135 tcp_cwnd_validate
0.079877 tcp_event_new_data_sent
0.079817 fcheck_files
0.079082 ip_skb_dst_mtu
0.078804 ip_finish_output
0.078278 wakeup_preempt_entity
0.077026 sel_netif_find
0.076788 __skb_queue_tail
0.076570 sock_flag
0.076520 tcp_win_from_space
0.076510 zone_watermark_ok
0.076282 sel_netnode_sid
0.076162 policy_zonelist
0.074732 __wake_up_common
0.074613 compound_head
0.074593 task_has_perm
0.073243 __find_general_cachep
0.073064 tcp_push
0.072925 skb_cloned
0.072309 pskb_may_pull
0.071852 TCP_ECN_check_ce
0.071495 cap_task_to_inode
0.070770 default_wake_function
0.069429 xfrm4_policy_check
0.069091 tcp_parse_md5sig_option
0.068287 tcp_v4_md5_do_lookup
0.068059 tcp_v4_tw_remember_stamp
0.067344 tcp_ca_event
0.067125 tcp_ca_event
0.065457 place_entity
0.065318 write_seqlock
0.065089 device_not_available
0.065069 test_ti_thread_flag
0.063878 tcp_set_skb_tso_segs
0.063550 selinux_netlbl_inode_permission
0.063391 sock_wfree
0.063311 prepare_to_wait
0.058872 pid_vnr
0.058803 __cycles_2_ns
0.057631 ip_local_out
0.057333 tcp_ack_saw_tstamp
0.056896 copy_to_user
0.056628 set_bit
0.055913 free_pages_check
0.054969 tcp_rcv_rtt_measure_ts
0.053797 init_rootdomain
0.053708 selinux_socket_recvmsg
0.053698 pid_nr_ns
0.053629 sk_eat_skb
0.052814 _local_bh_enable
0.052645 nf_hook_thresh
0.052516 sched_info_queued
0.052457 enqueue_task
0.052228 sk_filter
0.052159 __cpu_clear
0.051980 local_bh_enable_ip
0.050292 update_rq_clock
0.048981 task_tgid_vnr
0.048881 copy_from_user
0.048782 tcp_parse_options
0.048484 lock_sock
0.047779 net_timestamp
0.047044 open_softirq
0.046955 tcp_win_from_space
0.045981 __skb_dequeue
0.043846 getboottime
0.043777 account_group_exec_runtime
0.043519 can_checksum_protocol
0.043469 set_user_nice
0.042784 skb_fill_page_desc
0.042247 security_socket_sendmsg
0.041989 read_profile
0.041930 tcp_validate_incoming
0.041612 check_preempt_curr
0.041413 skb_pull
0.041026 generic_smp_call_function_interrupt
0.041016 calc_delta_fair
0.040936 clear_buddies
0.040768 tcp_data_queue
0.040698 page_count
0.039695 lock_sock
0.039099 skb_headroom
0.038851 system_call_fastpath
0.038622 zone_statistics
0.037500 tcp_sack_extend
0.037381 __kmalloc_node
0.036587 first_zones_zonelist
0.036497 mntput
0.036179 pick_next_task
0.035991 kmap
0.035911 sock_put
0.035613 deactivate_task
0.035027 __nr_to_section
0.033985 page_zone
0.033190 native_load_tls
0.032882 netif_tx_queue_stopped
0.032713 __skb_insert
0.032187 sock_flag
0.031988 check_kill_permission
0.031790 policy_nodemask
0.031621 detach_timer
0.030558 inet_csk_clear_xmit_timer
0.030469 task_rq_unlock
0.029883 tcp_nagle_test
0.029744 tracesys
0.028383 virt_to_slab
0.028115 tcp_v4_check
0.028046 __cpu_set
0.027658 page_get_cache
0.027063 tcp_store_ts_recent
0.027053 __skb_pull
0.026953 gfp_zone
0.026586 sock_rcvlowat
0.026576 csum_partial
0.026397 init_waitqueue_head
0.026109 finish_wait
0.026040 kill_pid_info
0.025404 tcp_full_space
0.024888 __skb_queue_before
0.024550 dst_confirm
0.022603 inet_ehash_bucket
0.021888 activate_task
0.021650 tcp_rto_min
0.021283 d_callback
0.020965 signal_pending
0.020925 avc_node_free
0.020915 empty_bucket
0.020746 group_send_sig_info
0.020657 skb_reset_transport_header
0.020061 sock_put
0.019992 signal_pending_state
0.019684 tcp_sync_mss
0.019346 skb_network_offset
0.019276 skb_split
0.018988 tcp_adjust_fackets_out
0.018204 tcp_fast_path_check
0.017727 __skb_unlink
0.017687 napi_disable_pending
0.017678 sg_set_page
0.017022 get_pageblock_bitmap
0.016972 tcp_cong_avoid
0.016962 pid_task
0.016754 skb_set_tail_pointer
0.016039 selinux_ipv4_postroute
0.015930 idle_cpu
0.015632 skb_reset_network_header
0.015552 __count_vm_events
0.015483 source_load
0.014867 __skb_unlink
0.014738 skb_reset_transport_header
0.014599 set_bit
0.014241 audit_zero_context
0.014231 zone_page_state
0.014152 clear_bit
0.013874 PageSlab
0.013546 __memset
0.013238 get_pageblock_migratetype
0.012623 __rb_rotate_right
0.012543 kmem_find_general_cachep
0.012414 __kprobes_text_start
0.012344 security_sock_rcv_skb
0.012344 node_zonelist
0.012335 dnotify_parent
0.012096 skb_headroom
0.011778 tcp_push_one
0.011540 mnt_want_write
0.011143 kmalloc
0.011073 retint_swapgs
0.010954 __rb_rotate_left
0.010805 check_pgd_range
0.010785 tcp_mss_split_point
0.010755 migrate_timer_list
0.010338 __send_IPI_dest_field
0.010229 reschedule_interrupt
0.010179 sock_flag
0.009882 smp_call_function_mask
0.009673 test_tsk_need_resched
0.009564 tcp_urg
0.009504 generic_file_aio_read
0.009176 PageReserved
0.009147 net_invalid_timestamp
0.009087 __node_set
0.008749 do_tcp_setsockopt
0.008730 set_tsk_thread_flag
0.008720 tcp_enter_loss
0.008422 sock_error
0.008362 target_load
0.008302 crypto_hash_update
0.008104 PageReadahead
0.008044 tcp_poll
0.007915 tcp_checksum_complete
0.007329 tcp_snd_test
0.007309 selinux_file_permission
0.007290 sel_netif_destroy
0.007220 put_pages_list
0.006992 dst_output
0.006743 prepare_to_copy
0.006694 tcp_init_cwnd
0.006555 clear_bit
0.006535 set_bit
0.006425 normal_prio
0.006366 msleep
0.006346 error_sti
0.006336 tcp_rcv_rtt_update
0.006167 tcp_send_ack
0.005989 tcp_init_nondata_skb
0.005720 kfree_skb
0.005502 call_function_interrupt
0.005413 __count_vm_event
0.005403 __skb_checksum_complete_head
0.005363 page_cache_get_speculative
0.005323 dev_kfree_skb_irq
0.005174 skb_store_bits
0.004956 cpu_avg_load_per_task
0.004916 dev_cpu_callback
0.004807 __kmem_cache_destroy
0.004777 tcp_init_metrics
0.004777 io_schedule
0.004777 find_get_page
0.004707 eth_header_parse
0.004688 cap_task_kill
0.004678 error_exit
0.004668 rb_prev
0.004658 tso_fragment
0.004648 mmdrop
0.004628 skb_reset_tail_pointer
0.004598 apic_timer_interrupt
0.004588 clear_bit
0.004519 tcp_simple_retransmit
0.004449 get_max_files
0.004370 sk_stop_timer
0.004340 tcp_reset
0.004251 netlbl_cache_add
0.004201 tcp_add_reno_sack
0.004151 __pskb_trim_head
0.004102 __profile_flip_buffers
0.004092 sk_common_release
0.004052 audit_copy_inode
0.003953 eth_change_mtu
0.003943 vfs_read
0.003923 run_timer_softirq
0.003843 mnt_drop_write
0.003814 clear_page_c
0.003804 do_sync_read
0.003744 unset_migratetype_isolate
0.003714 sk_stream_moderate_sndbuf
0.003545 tcp_try_rmem_schedule
0.003476 native_apic_mem_write
0.003466 sys_read
0.003446 skb_checksum
0.003436 timer_set_base
0.003426 security_task_kill
0.003416 __flow_cache_shrink
0.003406 __skb_checksum_complete
0.003277 alloc_skb
0.003267 physflat_send_IPI_mask
0.003218 skb_gso_ok
0.003178 constant_test_bit
0.003168 find_next_bit
0.003158 selinux_netlbl_skbuff_getsid
0.003118 constant_test_bit
0.003099 pull_task
0.003079 hrtimer_run_queues
0.003049 free_hot_page
0.003009 scheduler_tick
0.002900 set_32bit_tls
0.002890 tcp_acceptable_seq
0.002811 rw_verify_area
0.002751 radix_tree_lookup_slot
0.002731 zero_user_segment
0.002731 sock_common_setsockopt
0.002612 __load_balance_iterator
0.002473 run_posix_cpu_timers
0.002264 task_utime
0.002254 switched_to_fair
0.002185 fsnotify_access
0.002145 __rmqueue_smallest
0.002125 __schedule_bug
0.002095 __task_rq_lock
0.002086 tcp_may_update_window
0.002076 restore_args
0.002066 hrtimer_run_pending
0.002056 generic_segment_checks
0.002026 getnstimeofday
0.002006 idle_task
0.001976 touch_atime
0.001956 __wake_up_locked
0.001927 sk_mem_charge
0.001877 smp_apic_timer_interrupt
0.001827 native_smp_send_reschedule
0.001798 __tcp_fast_path_on
0.001788 file_read_actor
0.001768 _cond_resched
0.001738 avc_policy_seqno
0.001718 tcp_ack_snd_check
0.001629 ip_send_check
0.001619 account_system_time
0.001579 __xapic_wait_icr_idle
0.001579 get_stats
0.001539 tcp_set_state
0.001539 bictcp_state
0.001529 tcp_fast_path_on
0.001519 file_accessed
0.001480 get_seconds
0.001450 kernel_math_error
0.001410 ktime_set
0.001331 kmap_atomic
0.001281 printk_tick
0.001281 __next_cpu_nr
0.001271 account_group_system_time
0.001261 __mod_zone_page_state
0.001222 weighted_cpuload
0.001192 security_file_permission
0.001162 ack_APIC_irq
0.001152 __free_one_page
0.001142 rcu_pending
0.001142 drain_array
0.001122 sched_clock_tick
0.001122 csum_fold
0.001102 ret_from_intr
0.001083 retint_careful
0.001073 need_resched
0.001073 calc_delta_mine
0.001043 tcp_v4_md5_do_del
0.001043 PageActive
0.001033 mark_page_accessed
0.001033 ktime_get_ts
0.001023 tcp_insert_write_queue_after
0.001013 tcp_delack_timer
0.001013 task_tick_fair
0.000973 delay_tsc
0.000963 nv_nic_irq_optimized
0.000904 tick_periodic
0.000894 skb_reserve
0.000884 cache_reap
0.000874 timespec_trunc
0.000864 skb_header_release
0.000854 zone_page_state_add
0.000844 update_process_times
0.000834 sk_rmem_schedule
0.000824 find_busiest_group
0.000804 current_fs_time
0.000785 tick_handle_periodic
0.000785 __sk_mem_schedule
0.000785 irq_enter
0.000755 use_cpu_writer_for_mount
0.000755 tcp_ratehalving_spur_to_response
0.000745 update_wall_time
0.000745 tcp_sendpage
0.000745 __alloc_pages_nodemask
0.000725 ktime_get
0.000725 irq_exit
0.000705 inotify_inode_queue_event
0.000665 set_pageblock_flags_group
0.000646 inotify_dentry_parent_queue_event
0.000626 ack_APIC_irq
0.000606 write_profile
0.000566 set_normalized_timespec
0.000566 raise_softirq
0.000526 task_cputime_zero
0.000516 smp_reschedule_interrupt
0.000516 __skb_insert
0.000497 page_fault
0.000497 __copy_user_nocache
0.000487 run_local_timers
0.000487 read_tsc
0.000487 nf_unregister_hook
0.000477 __rcu_pending
0.000477 jiffies_to_usecs
0.000457 timespec_to_ktime
0.000437 __skb_trim
0.000427 __call_rcu
0.000417 free_pages_bulk
0.000407 smp_call_function_interrupt
0.000397 set_irq_regs
0.000397 radix_tree_deref_slot
0.000397 expand
0.000387 handle_mm_fault
0.000387 handle_IRQ_event
0.000387 fput_light
0.000377 refresh_cpu_vm_stats
0.000377 n_tty_write
0.000367 get_page
0.000358 run_rebalance_domains
0.000358 get_cpu_mask
0.000348 task_hot
0.000348 __skb_queue_after
0.000348 retint_check
0.000348 do_select
0.000338 PageUptodate
0.000338 copy_page_c
0.000328 cond_resched
0.000318 unmap_vmas
0.000318 sk_mem_reclaim
0.000318 rmqueue_bulk
0.000318 reciprocal_value
0.000318 irq_return
0.000308 rb_first
0.000308 alloc_skb
0.000308 account_process_tick
0.000298 net_enable_timestamp
0.000298 clocksource_read
0.000298 account_system_time_scaled
0.000288 sched_slice
0.000278 ip_compute_csum
0.000278 constant_test_bit
0.000278 constant_test_bit
0.000268 set_curr_task_fair
0.000268 note_interrupt
0.000268 exit_idle
0.000258 native_apic_mem_write
0.000258 exit_intr
0.000248 PageReferenced
0.000238 usb_hcd_irq
0.000238 __mnt_is_readonly
0.000238 constant_test_bit
0.000218 IRQ0xba_interrupt
0.000218 handle_fasteoi_irq
0.000209 raise_softirq_irqoff
0.000209 __find_get_block
0.000199 tcp_current_ssthresh
0.000199 n_tty_receive_buf
0.000189 wake_up_page
0.000189 vgacon_save_screen
0.000189 free_block
0.000189 constant_test_bit
0.000179 pagefault_disable
0.000169 clocksource_get_next
0.000169 __bitmap_weight
0.000159 tty_ldisc_deref
0.000159 tcp_write_timer
0.000159 kmem_cache_alloc
0.000159 free_alien_cache
0.000159 ext3_mark_iloc_dirty
0.000159 constant_test_bit
0.000159 __bitmap_equal
0.000149 transfer_objects
0.000149 __rcu_process_callbacks
0.000149 page_waitqueue
0.000149 constant_test_bit
0.000139 __rmqueue
0.000139 release_pages
0.000139 constant_test_bit
0.000129 __tcp_checksum_complete
0.000129 run_workqueue
0.000129 poll_freewait
0.000129 n_tty_read
0.000129 iommu_area_free
0.000129 generic_file_llseek
0.000129 __cpus_setall
0.000129 cond_resched_softirq
0.000129 avc_node_populate
0.000129 add_to_page_cache_lru
0.000129 account_user_time
0.000119 wait_consider_task
0.000119 sys_select
0.000119 round_jiffies_common
0.000119 nv_start_xmit_optimized
0.000119 core_sys_select
0.000109 tcp_tso_segment
0.000109 sigprocmask
0.000109 proc_reg_read
0.000109 path_to_nameidata
0.000109 PageBuddy
0.000109 ohci_irq
0.000109 nv_tx_done_optimized
0.000109 nv_msi_workaround
0.000109 IRQ0xc2_interrupt
0.000109 __ext3_get_inode_loc
0.000109 account_group_user_time
0.000099 __wake_up_sync
0.000099 __up_read
0.000099 update_vsyscall
0.000099 memmove
0.000099 kmalloc
0.000099 ext3_get_blocks_handle
0.000099 do_device_not_available
0.000099 constant_test_bit
0.000089 tcp_incr_quickack
0.000089 smp_send_reschedule
0.000089 remove_from_page_cache
0.000089 rcu_process_callbacks
0.000089 prepare_to_wait_exclusive
0.000089 pde_users_dec
0.000089 find_first_bit
0.000089 constant_test_bit
0.000089 common_interrupt
0.000089 add_wait_queue
0.000079 task_gtime
0.000079 sys_lseek
0.000079 start_this_handle
0.000079 schedule_hrtimeout_range
0.000079 __sched_fork
0.000079 journal_put_journal_head
0.000079 find_first_zero_bit
0.000079 do_syslog
0.000079 do_sync_write
0.000079 constant_test_bit
0.000079 ack_apic_level
0.000070 write_seqlock
0.000070 slab_get_obj
0.000070 remove_wait_queue
0.000070 pty_chars_in_buffer
0.000070 ____pagevec_lru_add
0.000070 lock_hrtimer_base
0.000070 kstat_incr_irqs_this_cpu
0.000070 journal_dirty_data
0.000070 journal_add_journal_head
0.000070 find_lock_page
0.000070 copy_from_read_buf
0.000070 bit_waitqueue
0.000070 alloc_page_vma
0.000060 vfs_write
0.000060 tty_write
0.000060 __strnlen_user
0.000060 sk_mem_uncharge
0.000060 rt_worker_func
0.000060 radix_tree_preload
0.000060 poll_select_copy_remaining
0.000060 pagefault_enable
0.000060 __mark_inode_dirty
0.000060 lru_add_drain_all
0.000060 lock_page
0.000060 list_replace_init
0.000060 journal_stop
0.000060 iowrite8
0.000060 hrtimer_forward
0.000060 gart_unmap_single
0.000060 find_vma
0.000060 __down_read_trylock
0.000060 do_page_fault
0.000060 do_IRQ
0.000060 create_empty_buffers
0.000060 constant_test_bit
0.000060 constant_test_bit
0.000060 alloc_iommu
0.000060 add_to_page_cache_locked
0.000050 zero_fd_set
0.000050 vsnprintf
0.000050 unlock_page
0.000050 tty_read
0.000050 tty_poll
0.000050 sock_poll
0.000050 sock_def_error_report
0.000050 set_wq_data
0.000050 rcu_check_callbacks
0.000050 radix_tree_node_rcu_free
0.000050 pipe_poll
0.000050 opost
0.000050 n_tty_chars_in_buffer
0.000050 __next_cpu
0.000050 mutex_trylock
0.000050 msecs_to_jiffies
0.000050 mempool_alloc_slab
0.000050 load_elf_binary
0.000050 __link_path_walk
0.000050 __journal_remove_journal_head
0.000050 journal_commit_transaction
0.000050 journal_cancel_revoke
0.000050 irq_complete_move
0.000050 irq_cfg
0.000050 fsnotify_modify
0.000050 __first_cpu
0.000050 file_update_time
0.000050 filemap_fault
0.000050 ext3_new_blocks
0.000050 ext3_mark_inode_dirty
0.000050 do_wp_page
0.000050 __do_fault
0.000050 buffer_dirty
0.000050 anon_vma_prepare
0.000040 yield
0.000040 wq_per_cpu
0.000040 walk_page_buffers
0.000040 __wake_up_bit
0.000040 vma_adjust
0.000040 tty_put_char
0.000040 tty_paranoia_check
0.000040 tcp_current_ssthresh
0.000040 sys_write
0.000040 sys_rt_sigprocmask
0.000040 sock_no_bind
0.000040 show_stat
0.000040 SetPageSwapBacked
0.000040 set_irq_regs
0.000040 set_buffer_write_io_error
0.000040 recalc_sigpending
0.000040 radix_tree_delete
0.000040 queue_delayed_work_on
0.000040 pty_write
0.000040 __pollwait
0.000040 physflat_send_IPI_allbutself
0.000040 page_zone
0.000040 page_remove_rmap
0.000040 page_is_file_cache
0.000040 page_evictable
0.000040 nv_get_empty_tx_slots
0.000040 n_tty_poll
0.000040 next_zone
0.000040 next_online_pgdat
0.000040 need_resched
0.000040 mutex_unlock
0.000040 mpol_needs_cond_ref
0.000040 __lookup
0.000040 journal_invalidatepage
0.000040 journal_dirty_metadata
0.000040 ioread8
0.000040 input_available_p
0.000040 inet_csk_reset_xmit_timer
0.000040 get_fd_set
0.000040 generic_write_checks
0.000040 free_poll_entry
0.000040 fput
0.000040 __ext3_journal_stop
0.000040 ext3_get_group_desc
0.000040 ext3_get_block
0.000040 do_mpage_readpage
0.000040 __d_lookup
0.000040 del_page_from_lru
0.000040 __dec_zone_state
0.000040 copy_user_generic
0.000040 __bitmap_and
0.000040 add_page_to_lru_list
0.000040 account_user_time_scaled
0.000040 account_steal_time
0.000030 worker_thread
0.000030 wake_up_bit
0.000030 vmstat_update
0.000030 vm_normal_page
0.000030 tty_write_unlock
0.000030 tty_write_lock
0.000030 tty_wakeup
0.000030 tty_ldisc_try
0.000030 tty_ioctl
0.000030 tag_get
0.000030 sys_pread64
0.000030 submit_bh
0.000030 stop_this_cpu
0.000030 sock_aio_write
0.000030 sk_mem_reclaim
0.000030 sk_backlog_rcv
0.000030 show_interrupts
0.000030 sg_next
0.000030 seq_printf
0.000030 send_remote_softirq
0.000030 remove_vma
0.000030 reg_delay
0.000030 radix_tree_lookup
0.000030 radix_tree_insert
0.000030 proc_lookup_de
0.000030 pipe_write
0.000030 __percpu_counter_add
0.000030 pci_map_single
0.000030 nv_napi_poll
0.000030 __next_node
0.000030 native_send_call_func_ipi
0.000030 mpage_readpages
0.000030 mix_pool_bytes_extract
0.000030 mii_rw
0.000030 mempool_alloc
0.000030 __make_request
0.000030 jbd_lock_bh_state
0.000030 iov_iter_copy_from_user_atomic
0.000030 insert_work
0.000030 hrtimer_try_to_cancel
0.000030 get_dma_ops
0.000030 __generic_file_aio_write_nolock
0.000030 gart_map_sg
0.000030 __fput
0.000030 fixup_irqs
0.000030 __find_get_block_slow
0.000030 filp_close
0.000030 ext3_get_branch
0.000030 ext3_dirty_inode
0.000030 ext3_block_to_path
0.000030 do_get_write_access
0.000030 delayed_work_timer_fn
0.000030 csum_block_add
0.000030 copy_process
0.000030 copy_page_range
0.000030 constant_test_bit
0.000030 constant_test_bit
0.000030 check_irqs_on
0.000030 call_rcu
0.000030 __brelse
0.000030 _atomic_dec_and_lock
0.000020 __xchg
0.000020 vm_stat_account
0.000020 vma_prio_tree_remove
0.000020 tty_mode_ioctl
0.000020 tty_audit_add_data
0.000020 try_to_free_buffers
0.000020 truncate_inode_pages_range
0.000020 tcp_slow_start
0.000020 task_curr
0.000020 sys_setpgid
0.000020 sys_rt_sigreturn
0.000020 sys_getppid
0.000020 strncpy_from_user
0.000020 sock_put
0.000020 smp_call_function
0.000020 __sk_mem_reclaim
0.000020 signal_wake_up
0.000020 signal_pending
0.000020 set_termios
0.000020 SetPageUptodate
0.000020 SetPageLRU
0.000020 set_fd_set
0.000020 set_bit
0.000020 __send_IPI_shortcut
0.000020 security_inode_need_killpriv
0.000020 scsi_request_fn
0.000020 sb_bread
0.000020 restore_i387_xstate
0.000020 __qdisc_run
0.000020 pud_alloc
0.000020 pmd_alloc
0.000020 pfn_pte
0.000020 pfifo_fast_enqueue
0.000020 pfifo_fast_dequeue
0.000020 pci_map_page
0.000020 path_get
0.000020 __pagevec_free
0.000020 pagevec_add
0.000020 PageUnevictable
0.000020 page_mapping
0.000020 nv_get_hw_stats
0.000020 number
0.000020 normalize_rt_tasks
0.000020 __netif_tx_lock
0.000020 mk_pid
0.000020 memscan
0.000020 memcpy_c
0.000020 __lru_cache_add
0.000020 __lookup_mnt
0.000020 load_balance_rt
0.000020 kthread_should_stop
0.000020 journal_start
0.000020 journal_remove_journal_head
0.000020 __journal_file_buffer
0.000020 jbd_unlock_bh_journal_head
0.000020 itimer_get_remtime
0.000020 irq_to_desc
0.000020 iowrite32
0.000020 inotify_remove_watch_locked
0.000020 inode_permission
0.000020 inode_has_perm
0.000020 init_timer
0.000020 goal_in_my_reservation
0.000020 get_vma_policy
0.000020 __get_free_pages
0.000020 generic_sync_sb_inodes
0.000020 gart_map_single
0.000020 freezing
0.000020 free_pgtables
0.000020 free_pages_and_swap_cache
0.000020 free_buffer_head
0.000020 __follow_mount
0.000020 flush_tlb_page
0.000020 find_busiest_queue
0.000020 file_has_perm
0.000020 ext3_try_to_allocate
0.000020 ext3_journal_start
0.000020 __ext3_journal_dirty_metadata
0.000020 ext3_file_write
0.000020 enqueue_hrtimer
0.000020 dup_mm
0.000020 do_wait
0.000020 do_vfs_ioctl
0.000020 do_path_lookup
0.000020 do_munmap
0.000020 do_machine_check
0.000020 do_lookup
0.000020 do_follow_link
0.000020 dma_unmap_single
0.000020 __dec_zone_page_state
0.000020 count_vm_event
0.000020 constant_test_bit
0.000020 constant_test_bit
0.000020 compound_head
0.000020 clear_buffer_jbddirty
0.000020 clear_buffer_delay
0.000020 claim_block
0.000020 cascade
0.000020 cancel_dirty_page
0.000020 cache_grow
0.000020 brelse
0.000020 __block_prepare_write
0.000020 __blocking_notifier_call_chain
0.000020 blk_rq_map_sg
0.000020 __bitmap_empty
0.000020 __bitmap_andnot
0.000020 anon_vma_unlink
0.000010 zone_page_state
0.000010 zero_user_segments
0.000010 __xchg
0.000010 __vma_link_rb
0.000010 vma_link
0.000010 vfs_llseek
0.000010 __up_write
0.000010 update_xtime_cache
0.000010 unmap_underlying_metadata
0.000010 unmap_region
0.000010 unix_poll
0.000010 tty_write_room
0.000010 tty_unthrottle
0.000010 tty_ldisc_ref_wait
0.000010 tty_ldisc_ref
0.000010 tty_fasync
0.000010 tty_check_change
0.000010 tty_chars_in_buffer
0.000010 tty_audit_fork
0.000010 truncate_complete_page
0.000010 test_tsk_thread_flag
0.000010 taskstats_exit
0.000010 sys_writev
0.000010 sys_readahead
0.000010 sys_poll
0.000010 sys_newstat
0.000010 sys_nanosleep
0.000010 sys_ioctl
0.000010 syscall_trace_leave
0.000010 sync_supers
0.000010 stub_execve
0.000010 split_page
0.000010 sock_kfree_s
0.000010 __sleep_on_page_lock
0.000010 skip_atoi
0.000010 signal_pending
0.000010 signal_pending
0.000010 sg_init_table
0.000010 set_task_cpu
0.000010 __set_page_dirty
0.000010 SetPageActive
0.000010 set_bit
0.000010 seq_puts
0.000010 selinux_task_setpgid
0.000010 selinux_secctx_to_secid
0.000010 selinux_sb_show_options
0.000010 selinux_inode_permission
0.000010 selinux_inode_need_killpriv
0.000010 selinux_inode_free_security
0.000010 selinux_inode_alloc_security
0.000010 selinux_d_instantiate
0.000010 security_vm_enough_memory
0.000010 second_overflow
0.000010 scsi_run_queue
0.000010 __scsi_put_command
0.000010 scsi_init_sgtable
0.000010 scsi_end_request
0.000010 schedule_tail
0.000010 schedule_delayed_work
0.000010 sb_any_quota_enabled
0.000010 rt_hash
0.000010 round_jiffies_relative
0.000010 remove_hrtimer
0.000010 __remove_hrtimer
0.000010 __remove_from_page_cache
0.000010 rcu_bh_qsctr_inc
0.000010 radix_tree_tag_clear
0.000010 radix_tree_gang_lookup_tag_slot
0.000010 radix_tree_gang_lookup_slot
0.000010 queue_delayed_work
0.000010 qdisc_run
0.000010 put_tty_queue_nolock
0.000010 put_io_context
0.000010 pty_write_room
0.000010 pty_open
0.000010 ptep_set_access_flags
0.000010 profile_munmap
0.000010 proc_pident_lookup
0.000010 proc_get_inode
0.000010 prio_tree_replace
0.000010 prio_tree_remove
0.000010 prio_tree_insert
0.000010 pmd_none_or_clear_bad
0.000010 pipe_release
0.000010 pipe_read
0.000010 pid_revalidate
0.000010 pgd_alloc
0.000010 pci_unmap_single
0.000010 pci_read_config_dword
0.000010 pci_conf1_write
0.000010 pci_bus_read_config_dword
0.000010 path_walk
0.000010 page_zone
0.000010 PageSwapCache
0.000010 PageSwapCache
0.000010 PageSwapCache
0.000010 __page_set_anon_rmap
0.000010 PagePrivate
0.000010 PagePrivate
0.000010 PagePrivate
0.000010 page_add_file_rmap
0.000010 on_each_cpu
0.000010 nv_do_interrupt
0.000010 net_tx_action
0.000010 netif_start_queue
0.000010 netif_carrier_ok
0.000010 need_resched
0.000010 need_iommu
0.000010 native_pte_clear
0.000010 native_io_delay
0.000010 mutex_lock
0.000010 mprotect_fixup
0.000010 mod_zone_page_state
0.000010 mntput_no_expire
0.000010 mm_init
0.000010 mmap_region
0.000010 mempool_free
0.000010 memcmp
0.000010 mcheck_check_cpu
0.000010 may_open
0.000010 __lookup_tag
0.000010 locks_remove_posix
0.000010 locks_remove_flock
0.000010 lock_buffer
0.000010 load_elf_binary
0.000010 load_balance_fair
0.000010 ll_back_merge_fn
0.000010 kzalloc
0.000010 ktime_add_safe
0.000010 kill_fasync
0.000010 __journal_temp_unlink_buffer
0.000010 journal_switch_revoke_table
0.000010 __journal_remove_checkpoint
0.000010 journal_get_write_access
0.000010 journal_get_undo_access
0.000010 journal_get_descriptor_buffer
0.000010 journal_bmap
0.000010 jbd_unlock_bh_state
0.000010 jbd_unlock_bh_state
0.000010 IRQ0xd2_interrupt
0.000010 ip_append_data
0.000010 iov_iter_advance
0.000010 iov_fault_in_pages_read
0.000010 iommu_area_alloc
0.000010 inode_sub_bytes
0.000010 inode_doinit_with_dentry
0.000010 inode_add_bytes
0.000010 __inc_zone_page_state
0.000010 inc_zone_page_state
0.000010 hweight_long
0.000010 hweight64
0.000010 hrtimer_wakeup
0.000010 hrtimer_init
0.000010 hash_64
0.000010 half_md4_transform
0.000010 __grab_cache_page
0.000010 get_user_pages
0.000010 get_signal_to_deliver
0.000010 get_random_int
0.000010 getname
0.000010 get_empty_filp
0.000010 __getblk
0.000010 generic_permission
0.000010 generic_make_request
0.000010 generic_fillattr
0.000010 generic_file_open
0.000010 generic_file_llseek_unlocked
0.000010 generic_file_buffered_write
0.000010 generic_file_aio_write
0.000010 generic_cont_expand_simple
0.000010 generic_block_bmap
0.000010 freezing
0.000010 free_swap_cache
0.000010 free_pid
0.000010 free_pgd_range
0.000010 free_pages
0.000010 flush_old_exec
0.000010 first_online_pgdat
0.000010 find_vma_prepare
0.000010 find_task_by_pid_type_ns
0.000010 find_next_zero_bit
0.000010 find_inode_fast
0.000010 file_remove_suid
0.000010 file_mask_to_av
0.000010 file_free_rcu
0.000010 __FD_CLR
0.000010 ext3_write_begin
0.000010 ext3_try_to_allocate_with_rsv
0.000010 ext3_ordered_write_end
0.000010 ext3_journalled_set_page_dirty
0.000010 ext3_invalidatepage
0.000010 ext3_iget_acl
0.000010 ext3_get_inode_flags
0.000010 ext3_free_data
0.000010 ext3_discard_reservation
0.000010 exit_thread
0.000010 exit_task_namespaces
0.000010 exit_sem
0.000010 end_that_request_last
0.000010 end_buffer_write_sync
0.000010 end_buffer_async_write
0.000010 elv_rb_del
0.000010 elv_queue_empty
0.000010 elv_merged_request
0.000010 elv_completed_request
0.000010 elf_map
0.000010 echo_char
0.000010 e1000_watchdog
0.000010 e1000_read_phy_reg
0.000010 __drain_alien_cache
0.000010 __d_path
0.000010 __down_write_nested
0.000010 __down_write
0.000010 double_rq_lock
0.000010 do_timer
0.000010 do_sys_open
0.000010 do_sigaltstack
0.000010 do_sigaction
0.000010 do_setitimer
0.000010 do_pipe_flags
0.000010 __do_page_cache_readahead
0.000010 do_notify_parent
0.000010 do_filp_open
0.000010 do_exit
0.000010 dnotify_flush
0.000010 d_kill
0.000010 destroy_inode
0.000010 dequeue_signal
0.000010 de_put
0.000010 delayacct_end
0.000010 create_write_pipe
0.000010 create_workqueue_thread
0.000010 __cpus_equal
0.000010 cpu_quiet
0.000010 __cpu_clear
0.000010 __cpu_clear
0.000010 count
0.000010 copy_thread
0.000010 copy_namespaces
0.000010 constant_test_bit
0.000010 constant_test_bit
0.000010 constant_test_bit
0.000010 constant_test_bit
0.000010 constant_test_bit
0.000010 __cond_resched
0.000010 clocksource_forward_now
0.000010 __clear_user
0.000010 clear_inode
0.000010 clear_buffer_new
0.000010 clear_bit
0.000010 clear_bit
0.000010 check_for_bios_corruption
0.000010 __cfq_slice_expired
0.000010 cfq_set_request
0.000010 cfq_dispatch_requests
0.000010 cfq_completed_request
0.000010 cap_set_effective
0.000010 can_share_swap_page
0.000010 bvec_alloc_bs
0.000010 buffer_uptodate
0.000010 buffer_mapped
0.000010 buffer_locked
0.000010 buffer_jbd
0.000010 buffer_jbd
0.000010 brelse
0.000010 __bread
0.000010 blk_invoke_request_fn
0.000010 __blk_complete_request
0.000010 blk_add_trace_generic
0.000010 blk_add_trace_bio
0.000010 bit_spin_lock
0.000010 bio_put
0.000010 bio_alloc_bioset
0.000010 bdi_read_congested
0.000010 balance_runtime
0.000010 balance_dirty_pages_ratelimited_nr
0.000010 audit_log_task_context
0.000010 ata_sff_qc_prep
0.000010 ata_scsi_queuecmd
0.000010 ata_link_max_devices
0.000010 ata_get_xlat_func
0.000010 arp_process
0.000010 arch_pick_mmap_layout
0.000010 arch_irq_stat_cpu
0.000010 arch_dup_task_struct
0.000010 alloc_pid
0.000010 alloc_fdtable
0.000010 alloc_fd
0.000010 add_mm_rss
0.000010 acct_collect

2008-11-17 19:22:17

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Ingo Molnar <[email protected]>
Date: Mon, 17 Nov 2008 12:01:19 +0100

> The scheduler's overhead barely even registers on a 16-way x86 system
> i'm running tbench on. Here's the NMI profile during 64 threads tbench
> on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:

Try a non-NMI profile.

It's the whole of the try_to_wake_up() path that's the problem.

2008-11-17 19:32:34

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Ingo Molnar <[email protected]>
Date: Mon, 17 Nov 2008 17:11:35 +0100

> Ouch, +4% from a oneliner networking change? That's a _huge_ speedup
> compared to the things we were after in scheduler land.

The scheduler has accounted for at least %10 of the tbench
regressions at this point, what are you talking about?

2008-11-17 19:31:52

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Ingo Molnar a ?crit :
> * Ingo Molnar <[email protected]> wrote:
>
> 4> The place for the sock_rfree() hit looks a bit weird, and i'll
>> investigate it now a bit more to place the real overhead point
>> properly. (i already mapped the test-bit overhead: that comes from
>> napi_disable_pending())
>
> ok, here's a new set of profiles. (again for tbench 64-thread on a
> 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i
> posted before.)
>
> Here are the per major subsystem percentages:
>
> NET overhead ( 5786945/10096751): 57.31%
> security overhead ( 925933/10096751): 9.17%
> usercopy overhead ( 837887/10096751): 8.30%
> sched overhead ( 753662/10096751): 7.46%
> syscall overhead ( 268809/10096751): 2.66%
> IRQ overhead ( 266500/10096751): 2.64%
> slab overhead ( 180258/10096751): 1.79%
> timer overhead ( 92986/10096751): 0.92%
> pagealloc overhead ( 87381/10096751): 0.87%
> VFS overhead ( 53295/10096751): 0.53%
> PID overhead ( 44469/10096751): 0.44%
> pagecache overhead ( 33452/10096751): 0.33%
> gtod overhead ( 11064/10096751): 0.11%
> IDLE overhead ( 0/10096751): 0.00%
> ---------------------------------------------------------
> left ( 753878/10096751): 7.47%
>
> The breakdown is very similar to what i sent before, within noise.
>
> [ 'left' is random overhead from all around the place - i categorized
> the 500 most expensive functions in the profile per subsystem.
> I stopped short of doing it for all 1300+ functions: it's rather
> laborous manual work even with hefty use of regex patterns.
> It's also less meaningful in practice: the trend in the first 500
> functions is present in the remaining 800 functions as well. I
> watched the breakdown evolve as i increased the coverage - in
> practice it is the first 100 functions that matter - it just doesnt
> change after that. ]
>
> The readprofile output below seems structured in a more useful way now
> - i tweaked compiler options to have the profiler hits spread out in a
> more meaningful way. I collected 10 million NMI profiler hits, and
> normalized the readprofile output up to 100%.
>
> [ I'll post per function analysis as i complete them, as a reply to
> this mail. ]
>
> Ingo
>
> 100.000000 total
> ................
> 7.253355 copy_user_generic_string
> 3.934833 avc_has_perm_noaudit

> 3.356152 ip_queue_xmit

> 3.038025 skb_release_data
> 2.118525 skb_release_head_state
> 1.997533 tcp_ack
> 1.833688 tcp_recvmsg

> 1.717771 eth_type_trans
Strange, in my profile, eth_type_trans is not in the top 20
Maybe an alignment problem ?
Oh, I understand, you hit the netdevice->last_rx update probblem, already corrected on net-next-2.6

> 1.673249 __inet_lookup_established
TCP established/timewait table is now RCUified (for linux-2.6.29), this one
should go down in profiles.

> 1.508888 system_call

> 1.469183 tcp_current_mss
Yes there is a divide that might be expensive. discussion on netdev.

> 1.431553 tcp_transmit_skb
> 1.385125 tcp_sendmsg
> 1.327643 tcp_v4_rcv
> 1.292328 nf_hook_thresh
> 1.203205 schedule
> 1.059501 nf_hook_slow
> 1.027373 constant_test_bit
> 0.945183 sock_rfree
> 0.922748 __switch_to
> 0.911605 netif_rx
> 0.876270 register_gifconf
> 0.788200 ip_local_deliver_finish
> 0.781467 dev_queue_xmit
> 0.766530 constant_test_bit
> 0.758208 _local_bh_enable_ip
> 0.747184 load_cr3
> 0.704341 memset_c
> 0.671260 sysret_check
> 0.651845 ip_finish_output2
> 0.620204 audit_free_names

2008-11-17 19:36:47

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Ingo Molnar <[email protected]>
Date: Mon, 17 Nov 2008 18:08:44 +0100

> Mike Galbraith has been spending months trying to pin down all the
> issues.

Yes Mike has been doing tireless good work.

Another thing I noticed is that because all of the scheduler
core operations are now function pointer callbacks, the
call chain is deeper for core operations like wake_up().

Much of it used to be completely inlined into try_to_wake_up()

With the addition of the RB tree stuff, that adds yet another
unavoidable depth of function call.

wake_up() is usually at the deepest part of the call chain,
so this is a big deal

2008-11-17 19:39:47

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Ingo Molnar <[email protected]>
Date: Mon, 17 Nov 2008 19:49:51 +0100

>
> * Ingo Molnar <[email protected]> wrote:
>
> 4> The place for the sock_rfree() hit looks a bit weird, and i'll
> > investigate it now a bit more to place the real overhead point
> > properly. (i already mapped the test-bit overhead: that comes from
> > napi_disable_pending())
>
> ok, here's a new set of profiles. (again for tbench 64-thread on a
> 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i
> posted before.)

Again, do a non-NMI profile and the top (at least for me)
looks like this:

samples % app name symbol name
473 6.3928 vmlinux finish_task_switch
349 4.7169 vmlinux tcp_v4_rcv
327 4.4195 vmlinux U3copy_from_user
322 4.3519 vmlinux tl0_linux32
178 2.4057 vmlinux tcp_ack
170 2.2976 vmlinux tcp_sendmsg
167 2.2571 vmlinux U3copy_to_user

That tcp_v4_rcv() hit is %98 on the wake_up() call it does.

2008-11-17 19:43:59

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

David Miller a ?crit :
> From: Ingo Molnar <[email protected]>
> Date: Mon, 17 Nov 2008 19:49:51 +0100
>
>> * Ingo Molnar <[email protected]> wrote:
>>
>> 4> The place for the sock_rfree() hit looks a bit weird, and i'll
>>> investigate it now a bit more to place the real overhead point
>>> properly. (i already mapped the test-bit overhead: that comes from
>>> napi_disable_pending())
>> ok, here's a new set of profiles. (again for tbench 64-thread on a
>> 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i
>> posted before.)
>
> Again, do a non-NMI profile and the top (at least for me)
> looks like this:
>
> samples % app name symbol name
> 473 6.3928 vmlinux finish_task_switch
> 349 4.7169 vmlinux tcp_v4_rcv
> 327 4.4195 vmlinux U3copy_from_user
> 322 4.3519 vmlinux tl0_linux32
> 178 2.4057 vmlinux tcp_ack
> 170 2.2976 vmlinux tcp_sendmsg
> 167 2.2571 vmlinux U3copy_to_user
>
> That tcp_v4_rcv() hit is %98 on the wake_up() call it does.
>
>

Another profile from my tree (net-next-2.6 + some patches), on my machine


CPU: Core 2, speed 3000.22 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % symbol name
223265 9.2711 __copy_user_zeroing_intel
87525 3.6345 __copy_user_intel
73203 3.0398 tcp_sendmsg
53229 2.2103 netif_rx
53041 2.2025 tcp_recvmsg
47241 1.9617 sysenter_past_esp
42888 1.7809 __copy_from_user_ll
40858 1.6966 tcp_transmit_skb
39390 1.6357 __switch_to
37363 1.5515 dst_release
36823 1.5291 __sk_dst_check_get
36050 1.4970 tcp_v4_rcv
35829 1.4878 __do_softirq
32333 1.3426 tcp_rcv_established
30451 1.2645 tcp_clean_rtx_queue
29758 1.2357 ip_queue_xmit
28497 1.1833 __copy_to_user_ll
28119 1.1676 release_sock
25218 1.0472 lock_sock_nested
23701 0.9842 __inet_lookup_established
23463 0.9743 tcp_ack
22989 0.9546 netif_receive_skb
21880 0.9086 sched_clock_cpu
20730 0.8608 tcp_write_xmit
20372 0.8460 ip_rcv
20336 0.8445 local_bh_enable
19153 0.7953 __update_sched_clock
18603 0.7725 skb_release_data
17020 0.7068 local_bh_enable_ip
16932 0.7031 process_backlog
16299 0.6768 ip_finish_output
16279 0.6760 dev_queue_xmit
15858 0.6585 sock_recvmsg
15641 0.6495 native_read_tsc
15454 0.6417 sock_wfree
15366 0.6381 update_curr
14585 0.6056 sys_socketcall
14564 0.6048 __alloc_skb
14519 0.6029 __tcp_select_window
14417 0.5987 tcp_current_mss
14391 0.5976 nf_iterate
14221 0.5905 page_address
14122 0.5864 local_bh_disable


2008-11-17 19:48:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28



On Mon, 17 Nov 2008, David Miller wrote:
>
> The scheduler has accounted for at least %10 of the tbench
> regressions at this point, what are you talking about?

I'm wondering if you're not looking at totally different issues.

For example, if I recall correctly, David had a big hit on the hrtimers.
And I wonder if perhaps Ingo's numbers are without hrtimers or something?

The other possibility is that it's just a sparc suckiness issue, that
simply doesn't show up on x86.

Linus

2008-11-17 19:49:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28



On Mon, 17 Nov 2008, David Miller wrote:

> From: Ingo Molnar <[email protected]>
> Date: Mon, 17 Nov 2008 12:01:19 +0100
>
> > The scheduler's overhead barely even registers on a 16-way x86 system
> > i'm running tbench on. Here's the NMI profile during 64 threads tbench
> > on a 16-way x86 box with an v2.6.28-rc5 kernel [config attached]:
>
> Try a non-NMI profile.
>
> It's the whole of the try_to_wake_up() path that's the problem.

David, that makes no sense. A NMI profile is going to be a _lot_ more
accurate than a non-NMI one. Asking somebody to do a clearly inferior
profile to get "better numbers" is insane.

We've asked _you_ to do NMI profiling, it shouldn't be the other way
around.

Linus

2008-11-17 19:51:20

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Linus Torvalds <[email protected]>
Date: Mon, 17 Nov 2008 11:47:24 -0800 (PST)

> For example, if I recall correctly, David had a big hit on the hrtimers.

That got fixed, the HRTIMER bits are now disabled.

> The other possibility is that it's just a sparc suckiness issue, that
> simply doesn't show up on x86.

Could be and I intend to measure that to find out.

2008-11-17 19:53:14

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Linus Torvalds <[email protected]>
Date: Mon, 17 Nov 2008 11:48:33 -0800 (PST)

> We've asked _you_ to do NMI profiling, it shouldn't be the other way
> around.

I wasn't able to on these systems, so instead I did cycle level
evaluation of the parts that have to run with interrupts disabled.

And as a result I found that wake_up() is now 4 times slower than it
was in 2.6.22, I even analyzed this for every single kernel release
till now.

It could be a sparc specific issue, because the call chain is deeper
and we eat a lot more register window spills onto the stack.

2008-11-17 19:53:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Linus Torvalds <[email protected]> wrote:

> On Mon, 17 Nov 2008, David Miller wrote:
> >
> > The scheduler has accounted for at least %10 of the tbench
> > regressions at this point, what are you talking about?
>
> I'm wondering if you're not looking at totally different issues.
>
> For example, if I recall correctly, David had a big hit on the
> hrtimers. And I wonder if perhaps Ingo's numbers are without
> hrtimers or something?

hrtimers should not be an issue anymore since this commit:

| commit 0c4b83da58ec2e96ce9c44c211d6eac5f9dae478
| Author: Ingo Molnar <[email protected]>
| Date: Mon Oct 20 14:27:43 2008 +0200
|
| sched: disable the hrtick for now
|
| David Miller reported that hrtick update overhead has tripled the
| wakeup overhead on Sparc64.
|
| That is too much - disable the HRTICK feature for now by default,
| until a faster implementation is found.
|
| Reported-by: David Miller <[email protected]>
| Acked-by: Peter Zijlstra <[email protected]>
| Signed-off-by: Ingo Molnar <[email protected]>

Which was included in v2.6.28-rc1 already.

Ingo

2008-11-17 19:56:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28



On Mon, 17 Nov 2008, David Miller wrote:
>
> Again, do a non-NMI profile and the top (at least for me)
> looks like this:

Can _you_ please do a NMI profile and see what your real problem is?

I can't imagine that Niagara (or whatever) is so weak that it can't do
NMI's.

The fact is, David, that Ingo just posted a profile that was _better_ than
anything you have ever posted, and it doesn't show what you complain
about. So he's not seeing it. Asking him to do a _stupid_ profile is just
that: stupid.

So try to figure out why his (better) profile doesn't match your
(inferior) one, instead of asking him to do stupid things. It's some
difference in architectures, likely: maybe the sparc timekeeping is crap,
maybe it's a cache issue and sparc caches are crap, maybe it's something
where Niagara (is it niagara) has some oddness that shows up because it
has that odd four-threads+four-cores or whatever.

Linus

2008-11-17 19:58:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28



On Mon, 17 Nov 2008, David Miller wrote:
>
> And as a result I found that wake_up() is now 4 times slower than it
> was in 2.6.22, I even analyzed this for every single kernel release
> till now.

..and that's the one where you then pointed to hrtimers, and now you claim
that was fixed?

At least I haven't seen any new analysis since then.

> It could be a sparc specific issue, because the call chain is deeper
> and we eat a lot more register window spills onto the stack.

Oh, easily. In-order machines tend to have serious problems with indirect
function calls in particular. x86, in contrast, tends to not even notice,
especially if the indirect function is fairly static per call-site, and
predicts well.

There is a reason my machine is 15-20 times faster than yours.

Linus

2008-11-17 19:58:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


> [ I'll post per function analysis as i complete them, as a reply to
> this mail. ]

[ i'll do a separate mail for every function analyzed, the discussion
spreads better that way. ]

> 100.000000 total
> ................
> 7.253355 copy_user_generic_string

This is the Well-known pattern of user-copy overhead, which centers
around this single REP MOVS instruction:

nr-of-hits
.........
ffffffff80341eea: 42 83 e2 07 and $0x7,%edx
ffffffff80341eed: 677398 f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi)
ffffffff80341ef0: 3642 89 d1 mov %edx,%ecx
ffffffff80341ef2: 16260 f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
ffffffff80341ef4: 6554 31 c0 xor %eax,%eax
ffffffff80341ef6: 1958 c3 retq
ffffffff80341ef7: 0 90 nop
ffffffff80341ef8: 0 90 nop

That's to be expected - tbench shuffles 3.5 GB of effective data
to/from sockets. That's 7.5 GB due to double-copy. So for every 64
bytes of data transferred we spend 1.4 CPU cycles in this specific
function - that is OK-ish.

Ingo

2008-11-17 20:16:54

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Linus Torvalds <[email protected]>
Date: Mon, 17 Nov 2008 11:55:35 -0800 (PST)

> So try to figure out why his (better) profile doesn't match your
> (inferior) one, instead of asking him to do stupid things. It's some
> difference in architectures, likely: maybe the sparc timekeeping is crap,
> maybe it's a cache issue and sparc caches are crap, maybe it's something
> where Niagara (is it niagara) has some oddness that shows up because it
> has that odd four-threads+four-cores or whatever.

It's on my workstation which is a much simpler 2 processor
UltraSPARC-IIIi (1.5Ghz) system.

And yes I will investigate, it's all I've been doing in my
spare time these past few weeks.

2008-11-17 20:18:54

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Linus Torvalds <[email protected]>
Date: Mon, 17 Nov 2008 11:57:55 -0800 (PST)

> On Mon, 17 Nov 2008, David Miller wrote:
> > And as a result I found that wake_up() is now 4 times slower than it
> > was in 2.6.22, I even analyzed this for every single kernel release
> > till now.
>
> ..and that's the one where you then pointed to hrtimers, and now you claim
> that was fixed?

That was a huge increase going from 2.6.26 to 2.6.27, and has
been fixed.

The rest of the gradual release-to-release cost increase, however,
remains.

> At least I haven't seen any new analysis since then.

I will find time ot make it after I get back from Portland.

2008-11-17 20:21:14

by Ingo Molnar

[permalink] [raw]
Subject: (avc_has_perm_noaudit()) Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

> 100.000000 total
> ................
> 3.934833 avc_has_perm_noaudit

this one seems spread out:

hits (total: 393483 hits)
.........
ffffffff80312af3: 1426 <avc_has_perm_noaudit>:
ffffffff80312af3: 1426 41 57 push %r15
ffffffff80312af5: 6124 41 56 push %r14
ffffffff80312af7: 0 41 55 push %r13
ffffffff80312af9: 1443 41 89 f5 mov %esi,%r13d
ffffffff80312afc: 1577 41 54 push %r12
ffffffff80312afe: 0 41 89 fc mov %edi,%r12d
ffffffff80312b01: 1310 55 push %rbp
ffffffff80312b02: 1531 53 push %rbx
ffffffff80312b03: 3 48 83 ec 68 sub $0x68,%rsp
ffffffff80312b07: 2202 85 c9 test %ecx,%ecx
ffffffff80312b09: 0 89 4c 24 0c mov %ecx,0xc(%rsp)
ffffffff80312b0d: 550 44 89 44 24 08 mov %r8d,0x8(%rsp)
ffffffff80312b12: 1572 4c 89 0c 24 mov %r9,(%rsp)
ffffffff80312b16: 0 66 89 54 24 12 mov %dx,0x12(%rsp)
ffffffff80312b1b: 588 75 04 jne ffffffff80312b21 <avc_has_perm_noaudit+0x2e>
ffffffff80312b1d: 0 0f 0b ud2a
ffffffff80312b1f: 0 eb fe jmp ffffffff80312b1f <avc_has_perm_noaudit+0x2c>
ffffffff80312b21: 1646 0f b7 44 24 12 movzwl 0x12(%rsp),%eax
ffffffff80312b26: 829 48 c7 c2 d0 26 93 80 mov $0xffffffff809326d0,%rdx
ffffffff80312b2d: 589 89 44 24 14 mov %eax,0x14(%rsp)
ffffffff80312b31: 698 65 8b 04 25 24 00 00 mov %gs:0x24,%eax
ffffffff80312b38: 0 00
ffffffff80312b39: 791 89 c0 mov %eax,%eax
ffffffff80312b3b: 549 48 c1 e0 03 shl $0x3,%rax
ffffffff80312b3f: 791 48 03 05 fa 30 5a 00 add 0x5a30fa(%rip),%rax # ffffffff808b5c40 <_cpu_pda>
ffffffff80312b46: 864 48 8b 00 mov (%rax),%rax
ffffffff80312b49: 533 48 03 50 08 add 0x8(%rax),%rdx
ffffffff80312b4d: 732 ff 02 incl (%rdx)
ffffffff80312b4f: 860 8b 54 24 14 mov 0x14(%rsp),%edx
ffffffff80312b53: 1259 e8 54 fc ff ff callq ffffffff803127ac <avc_hash>
ffffffff80312b58: 2087 48 98 cltq
ffffffff80312b5a: 1015 48 89 44 24 18 mov %rax,0x18(%rsp)
ffffffff80312b5f: 0 48 c1 e0 04 shl $0x4,%rax
ffffffff80312b63: 2944 4c 8d b8 60 6b a9 80 lea -0x7f5694a0(%rax),%r15
ffffffff80312b6a: 71 48 8b 80 60 6b a9 80 mov -0x7f5694a0(%rax),%rax
ffffffff80312b71: 3943 eb 1a jmp ffffffff80312b8d <avc_has_perm_noaudit+0x9a>
ffffffff80312b73: 5184 44 3b 23 cmp (%rbx),%r12d
ffffffff80312b76: 62007 75 11 jne ffffffff80312b89 <avc_has_perm_noaudit+0x96>
ffffffff80312b78: 11 66 8b 44 24 12 mov 0x12(%rsp),%ax
ffffffff80312b7d: 0 66 3b 43 08 cmp 0x8(%rbx),%ax
ffffffff80312b81: 11115 75 06 jne ffffffff80312b89 <avc_has_perm_noaudit+0x96>
ffffffff80312b83: 4 44 3b 6b 04 cmp 0x4(%rbx),%r13d
ffffffff80312b87: 14224 74 1a je ffffffff80312ba3 <avc_has_perm_noaudit+0xb0>
ffffffff80312b89: 1 48 8b 43 28 mov 0x28(%rbx),%rax
ffffffff80312b8d: 6921 48 8d 58 d8 lea -0x28(%rax),%rbx
ffffffff80312b91: 9654 48 8b 43 28 mov 0x28(%rbx),%rax
ffffffff80312b95: 414 0f 18 08 prefetcht0 (%rax)
ffffffff80312b98: 227 48 8d 43 28 lea 0x28(%rbx),%rax
ffffffff80312b9c: 9617 4c 39 f8 cmp %r15,%rax
ffffffff80312b9f: 1402 75 d2 jne ffffffff80312b73 <avc_has_perm_noaudit+0x80>
ffffffff80312ba1: 0 eb 41 jmp ffffffff80312be4 <avc_has_perm_noaudit+0xf1>
ffffffff80312ba3: 0 83 7b 20 01 cmpl $0x1,0x20(%rbx)
ffffffff80312ba7: 671 0f 84 70 02 00 00 je ffffffff80312e1d <avc_has_perm_noaudit+0x32a>
ffffffff80312bad: 0 c7 43 20 01 00 00 00 movl $0x1,0x20(%rbx)
ffffffff80312bb4: 0 e9 64 02 00 00 jmpq ffffffff80312e1d <avc_has_perm_noaudit+0x32a>
ffffffff80312bb9: 2118 65 8b 14 25 24 00 00 mov %gs:0x24,%edx
ffffffff80312bc0: 0 00
ffffffff80312bc1: 8245 89 d2 mov %edx,%edx
ffffffff80312bc3: 0 48 c7 c0 d0 26 93 80 mov $0xffffffff809326d0,%rax
ffffffff80312bca: 511 48 c1 e2 03 shl $0x3,%rdx
ffffffff80312bce: 11308 48 03 15 6b 30 5a 00 add 0x5a306b(%rip),%rdx # ffffffff808b5c40 <_cpu_pda>
ffffffff80312bd5: 0 48 8b 12 mov (%rdx),%rdx
ffffffff80312bd8: 35 48 03 42 08 add 0x8(%rdx),%rax
ffffffff80312bdc: 2224 ff 40 04 incl 0x4(%rax)
ffffffff80312bdf: 1 e9 06 01 00 00 jmpq ffffffff80312cea <avc_has_perm_noaudit+0x1f7>
ffffffff80312be4: 0 65 8b 14 25 24 00 00 mov %gs:0x24,%edx
ffffffff80312beb: 0 00
ffffffff80312bec: 0 89 d2 mov %edx,%edx
ffffffff80312bee: 0 48 c7 c0 d0 26 93 80 mov $0xffffffff809326d0,%rax
ffffffff80312bf5: 0 48 8d 6c 24 30 lea 0x30(%rsp),%rbp
ffffffff80312bfa: 0 48 c1 e2 03 shl $0x3,%rdx
ffffffff80312bfe: 0 48 03 15 3b 30 5a 00 add 0x5a303b(%rip),%rdx # ffffffff808b5c40 <_cpu_pda>
ffffffff80312c05: 0 44 89 ee mov %r13d,%esi
ffffffff80312c08: 0 4c 8d 45 0c lea 0xc(%rbp),%r8
ffffffff80312c0c: 0 44 89 e7 mov %r12d,%edi
ffffffff80312c0f: 0 48 8b 12 mov (%rdx),%rdx
ffffffff80312c12: 0 48 03 42 08 add 0x8(%rdx),%rax
ffffffff80312c16: 0 ff 40 08 incl 0x8(%rax)
ffffffff80312c19: 0 8b 4c 24 0c mov 0xc(%rsp),%ecx
ffffffff80312c1d: 0 8b 54 24 14 mov 0x14(%rsp),%edx
ffffffff80312c21: 0 e8 ee 0a 01 00 callq ffffffff80323714 <security_compute_av>
ffffffff80312c26: 0 85 c0 test %eax,%eax
ffffffff80312c28: 0 41 89 c6 mov %eax,%r14d
ffffffff80312c2b: 0 0f 85 02 02 00 00 jne ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312c31: 0 8b 7c 24 4c mov 0x4c(%rsp),%edi
ffffffff80312c35: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff80312c3a: 0 e8 a5 fb ff ff callq ffffffff803127e4 <avc_latest_notif_update>
ffffffff80312c3f: 0 85 c0 test %eax,%eax
ffffffff80312c41: 0 0f 85 9c 00 00 00 jne ffffffff80312ce3 <avc_has_perm_noaudit+0x1f0>
ffffffff80312c47: 0 e8 23 fd ff ff callq ffffffff8031296f <avc_alloc_node>
ffffffff80312c4c: 0 48 85 c0 test %rax,%rax
ffffffff80312c4f: 0 48 89 c3 mov %rax,%rbx
ffffffff80312c52: 0 0f 84 8b 00 00 00 je ffffffff80312ce3 <avc_has_perm_noaudit+0x1f0>
ffffffff80312c58: 0 8b 4c 24 14 mov 0x14(%rsp),%ecx
ffffffff80312c5c: 0 49 89 e8 mov %rbp,%r8
ffffffff80312c5f: 0 44 89 e6 mov %r12d,%esi
ffffffff80312c62: 0 48 89 c7 mov %rax,%rdi
ffffffff80312c65: 0 44 89 ea mov %r13d,%edx
ffffffff80312c68: 0 e8 5d fb ff ff callq ffffffff803127ca <avc_node_populate>
ffffffff80312c6d: 0 48 8b 44 24 18 mov 0x18(%rsp),%rax
ffffffff80312c72: 0 48 8d 2c 85 60 8b a9 lea -0x7f5674a0(,%rax,4),%rbp
ffffffff80312c79: 0 80
ffffffff80312c7a: 0 48 89 ef mov %rbp,%rdi
ffffffff80312c7d: 0 e8 44 3c 20 00 callq ffffffff805168c6 <_spin_lock_irqsave>
ffffffff80312c82: 0 49 8b 37 mov (%r15),%rsi
ffffffff80312c85: 0 49 89 c6 mov %rax,%r14
ffffffff80312c88: 0 eb 24 jmp ffffffff80312cae <avc_has_perm_noaudit+0x1bb>
ffffffff80312c8a: 0 44 39 26 cmp %r12d,(%rsi)
ffffffff80312c8d: 0 75 1b jne ffffffff80312caa <avc_has_perm_noaudit+0x1b7>
ffffffff80312c8f: 0 44 39 6e 04 cmp %r13d,0x4(%rsi)
ffffffff80312c93: 0 75 15 jne ffffffff80312caa <avc_has_perm_noaudit+0x1b7>
ffffffff80312c95: 0 66 8b 44 24 12 mov 0x12(%rsp),%ax
ffffffff80312c9a: 0 66 39 46 08 cmp %ax,0x8(%rsi)
ffffffff80312c9e: 0 75 0a jne ffffffff80312caa <avc_has_perm_noaudit+0x1b7>
ffffffff80312ca0: 0 48 89 df mov %rbx,%rdi
ffffffff80312ca3: 0 e8 9e fb ff ff callq ffffffff80312846 <avc_node_replace>
ffffffff80312ca8: 0 eb 2c jmp ffffffff80312cd6 <avc_has_perm_noaudit+0x1e3>
ffffffff80312caa: 0 48 8b 76 28 mov 0x28(%rsi),%rsi
ffffffff80312cae: 0 48 83 ee 28 sub $0x28,%rsi
ffffffff80312cb2: 0 48 8b 56 28 mov 0x28(%rsi),%rdx
ffffffff80312cb6: 0 48 8d 46 28 lea 0x28(%rsi),%rax
ffffffff80312cba: 0 4c 39 f8 cmp %r15,%rax
ffffffff80312cbd: 0 0f 18 0a prefetcht0 (%rdx)
ffffffff80312cc0: 0 75 c8 jne ffffffff80312c8a <avc_has_perm_noaudit+0x197>
ffffffff80312cc2: 0 48 8d 43 28 lea 0x28(%rbx),%rax
ffffffff80312cc6: 0 48 89 53 28 mov %rdx,0x28(%rbx)
ffffffff80312cca: 0 4c 89 78 08 mov %r15,0x8(%rax)
ffffffff80312cce: 0 48 89 46 28 mov %rax,0x28(%rsi)
ffffffff80312cd2: 0 48 89 42 08 mov %rax,0x8(%rdx)
ffffffff80312cd6: 0 4c 89 f6 mov %r14,%rsi
ffffffff80312cd9: 0 48 89 ef mov %rbp,%rdi
ffffffff80312cdc: 0 e8 20 3d 20 00 callq ffffffff80516a01 <_spin_unlock_irqrestore>
ffffffff80312ce1: 0 eb 07 jmp ffffffff80312cea <avc_has_perm_noaudit+0x1f7>
ffffffff80312ce3: 0 48 8d 44 24 30 lea 0x30(%rsp),%rax
ffffffff80312ce8: 0 eb 06 jmp ffffffff80312cf0 <avc_has_perm_noaudit+0x1fd>
ffffffff80312cea: 2116 48 89 d8 mov %rbx,%rax
ffffffff80312ced: 7632 45 31 f6 xor %r14d,%r14d
ffffffff80312cf0: 1 48 83 3c 24 00 cmpq $0x0,(%rsp)
ffffffff80312cf5: 404 74 10 je ffffffff80312d07 <avc_has_perm_noaudit+0x214>
ffffffff80312cf7: 1804 48 8d 70 0c lea 0xc(%rax),%rsi
ffffffff80312cfb: 0 b9 05 00 00 00 mov $0x5,%ecx
ffffffff80312d00: 378 48 8b 3c 24 mov (%rsp),%rdi
ffffffff80312d04: 8174 fc cld
ffffffff80312d05: 26860 f3 a5 rep movsl %ds:(%rsi),%es:(%rdi)
ffffffff80312d07: 11573 8b 40 0c mov 0xc(%rax),%eax
ffffffff80312d0a: 1997 f7 d0 not %eax
ffffffff80312d0c: 0 85 44 24 0c test %eax,0xc(%rsp)
ffffffff80312d10: 0 0f 84 1d 01 00 00 je ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312d16: 0 f6 44 24 08 01 testb $0x1,0x8(%rsp)
ffffffff80312d1b: 0 0f 85 f4 00 00 00 jne ffffffff80312e15 <avc_has_perm_noaudit+0x322>
ffffffff80312d21: 0 83 3d 5c 66 78 00 00 cmpl $0x0,0x78665c(%rip) # ffffffff80a99384 <selinux_enforcing>
ffffffff80312d28: 0 74 10 je ffffffff80312d3a <avc_has_perm_noaudit+0x247>
ffffffff80312d2a: 0 44 89 e7 mov %r12d,%edi
ffffffff80312d2d: 0 e8 87 f9 00 00 callq ffffffff803226b9 <security_permissive_sid>
ffffffff80312d32: 0 85 c0 test %eax,%eax
ffffffff80312d34: 0 0f 84 db 00 00 00 je ffffffff80312e15 <avc_has_perm_noaudit+0x322>
ffffffff80312d3a: 0 e8 30 fc ff ff callq ffffffff8031296f <avc_alloc_node>
ffffffff80312d3f: 0 48 85 c0 test %rax,%rax
ffffffff80312d42: 0 48 89 c5 mov %rax,%rbp
ffffffff80312d45: 0 0f 84 e8 00 00 00 je ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312d4b: 0 48 8b 44 24 18 mov 0x18(%rsp),%rax
ffffffff80312d50: 0 48 8d 04 85 60 8b a9 lea -0x7f5674a0(,%rax,4),%rax
ffffffff80312d57: 0 80
ffffffff80312d58: 0 48 89 c7 mov %rax,%rdi
ffffffff80312d5b: 0 48 89 44 24 28 mov %rax,0x28(%rsp)
ffffffff80312d60: 0 e8 61 3b 20 00 callq ffffffff805168c6 <_spin_lock_irqsave>
ffffffff80312d65: 0 49 8b 1f mov (%r15),%rbx
ffffffff80312d68: 0 48 89 44 24 20 mov %rax,0x20(%rsp)
ffffffff80312d6d: 0 eb 1a jmp ffffffff80312d89 <avc_has_perm_noaudit+0x296>
ffffffff80312d6f: 0 44 3b 23 cmp (%rbx),%r12d
ffffffff80312d72: 0 75 11 jne ffffffff80312d85 <avc_has_perm_noaudit+0x292>
ffffffff80312d74: 0 44 3b 6b 04 cmp 0x4(%rbx),%r13d
ffffffff80312d78: 0 75 0b jne ffffffff80312d85 <avc_has_perm_noaudit+0x292>
ffffffff80312d7a: 0 66 8b 44 24 12 mov 0x12(%rsp),%ax
ffffffff80312d7f: 0 66 3b 43 08 cmp 0x8(%rbx),%ax
ffffffff80312d83: 0 74 1a je ffffffff80312d9f <avc_has_perm_noaudit+0x2ac>
ffffffff80312d85: 0 48 8b 5b 28 mov 0x28(%rbx),%rbx
ffffffff80312d89: 0 48 83 eb 28 sub $0x28,%rbx
ffffffff80312d8d: 0 48 8b 43 28 mov 0x28(%rbx),%rax
ffffffff80312d91: 0 0f 18 08 prefetcht0 (%rax)
ffffffff80312d94: 0 48 8d 43 28 lea 0x28(%rbx),%rax
ffffffff80312d98: 0 4c 39 f8 cmp %r15,%rax
ffffffff80312d9b: 0 75 d2 jne ffffffff80312d6f <avc_has_perm_noaudit+0x27c>
ffffffff80312d9d: 0 eb 29 jmp ffffffff80312dc8 <avc_has_perm_noaudit+0x2d5>
ffffffff80312d9f: 0 8b 4c 24 14 mov 0x14(%rsp),%ecx
ffffffff80312da3: 0 44 89 e6 mov %r12d,%esi
ffffffff80312da6: 0 48 89 ef mov %rbp,%rdi
ffffffff80312da9: 0 49 89 d8 mov %rbx,%r8
ffffffff80312dac: 0 44 89 ea mov %r13d,%edx
ffffffff80312daf: 0 e8 16 fa ff ff callq ffffffff803127ca <avc_node_populate>
ffffffff80312db4: 0 8b 44 24 0c mov 0xc(%rsp),%eax
ffffffff80312db8: 0 09 45 0c or %eax,0xc(%rbp)
ffffffff80312dbb: 0 48 89 de mov %rbx,%rsi
ffffffff80312dbe: 0 48 89 ef mov %rbp,%rdi
ffffffff80312dc1: 0 e8 80 fa ff ff callq ffffffff80312846 <avc_node_replace>
ffffffff80312dc6: 0 eb 3c jmp ffffffff80312e04 <avc_has_perm_noaudit+0x311>
ffffffff80312dc8: 0 48 8b 3d a9 65 78 00 mov 0x7865a9(%rip),%rdi # ffffffff80a99378 <avc_node_cachep>
ffffffff80312dcf: 0 48 89 ee mov %rbp,%rsi
ffffffff80312dd2: 0 e8 7b c6 f7 ff callq ffffffff8028f452 <kmem_cache_free>
ffffffff80312dd7: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax
ffffffff80312dde: 0 00
ffffffff80312ddf: 0 89 c0 mov %eax,%eax
ffffffff80312de1: 0 48 c7 c2 d0 26 93 80 mov $0xffffffff809326d0,%rdx
ffffffff80312de8: 0 48 c1 e0 03 shl $0x3,%rax
ffffffff80312dec: 0 48 03 05 4d 2e 5a 00 add 0x5a2e4d(%rip),%rax # ffffffff808b5c40 <_cpu_pda>
ffffffff80312df3: 0 48 8b 00 mov (%rax),%rax
ffffffff80312df6: 0 48 03 50 08 add 0x8(%rax),%rdx
ffffffff80312dfa: 0 ff 42 14 incl 0x14(%rdx)
ffffffff80312dfd: 0 f0 ff 0d 60 65 78 00 lock decl 0x786560(%rip) # ffffffff80a99364 <avc_cache+0x2804>
ffffffff80312e04: 0 48 8b 74 24 20 mov 0x20(%rsp),%rsi
ffffffff80312e09: 0 48 8b 7c 24 28 mov 0x28(%rsp),%rdi
ffffffff80312e0e: 0 e8 ee 3b 20 00 callq ffffffff80516a01 <_spin_unlock_irqrestore>
ffffffff80312e13: 0 eb 1e jmp ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312e15: 0 41 be f3 ff ff ff mov $0xfffffff3,%r14d
ffffffff80312e1b: 0 eb 16 jmp ffffffff80312e33 <avc_has_perm_noaudit+0x340>
ffffffff80312e1d: 35502 8b 44 24 0c mov 0xc(%rsp),%eax
ffffffff80312e21: 4360 23 43 10 and 0x10(%rbx),%eax
ffffffff80312e24: 0 3b 44 24 0c cmp 0xc(%rsp),%eax
ffffffff80312e28: 0 0f 85 b6 fd ff ff jne ffffffff80312be4 <avc_has_perm_noaudit+0xf1>
ffffffff80312e2e: 104641 e9 86 fd ff ff jmpq ffffffff80312bb9 <avc_has_perm_noaudit+0xc6>
ffffffff80312e33: 2106 48 83 c4 68 add $0x68,%rsp
ffffffff80312e37: 1 44 89 f0 mov %r14d,%eax
ffffffff80312e3a: 2068 5b pop %rbx
ffffffff80312e3b: 0 5d pop %rbp
ffffffff80312e3c: 8 41 5c pop %r12
ffffffff80312e3e: 2001 41 5d pop %r13
ffffffff80312e40: 0 41 5e pop %r14
ffffffff80312e42: 162 41 5f pop %r15
ffffffff80312e44: 2107 c3 retq

its main callsite is:

ffffffff8031368c: 2809 <avc_has_perm>:
[...]
ffffffff803136b6: 651 e8 38 f4 ff ff callq ffffffff80312af3 <avc_has_perm_noaudit>

avc_has_perm() usage is spread out amongst 3 callsites in 2 selinux
functions:

selinux_ip_postroute():
ffffffff80314d02: 491 e8 85 e9 ff ff callq ffffffff8031368c <avc_has_perm>

selinux_socket_sock_rcv_skb():
ffffffff80314eea: 461 e8 9d e7 ff ff callq ffffffff8031368c <avc_has_perm>
ffffffff80314faf: 476 e8 d8 e6 ff ff callq ffffffff8031368c <avc_has_perm>

related to networking.

regarding avc_has_perm_noaudit() itself, it has a couple of hot spots:

ffffffff80312b73: 5184 44 3b 23 cmp (%rbx),%r12d
ffffffff80312b76: 62007 75 11 jne ffffffff80312b89 <avc_has_perm_noaudit+0x96>

quick guess: cache-cold-miss site.

ffffffff80312d04: 8174 fc cld
ffffffff80312d05: 26860 f3 a5 rep movsl %ds:(%rsi),%es:(%rdi)

quick guess: unnecessary initialization of something largish via
memset. Probably:

security/selinux/avc.c:avc_has_perm_noaudit()'s:
[...]
if (avd)
memcpy(avd, &p_ae->avd, sizeof(*avd));

but one of the fattest ones:

ffffffff80312e28: 0 0f 85 b6 fd ff ff jne ffffffff80312be4 <avc_has_perm_noaudit+0xf1>
ffffffff80312e2e: 104641 e9 86 fd ff ff jmpq ffffffff80312bb9 <avc_has_perm_noaudit+0xc6>
ffffffff80312e33: 2106 48 83 c4 68 add $0x68,%rsp

that seems to be either a branch mispredict (seems a tad expensive for
that though), or a cachemiss delayed to the first non-predicted
branch. Ah, that's most likely the case, we fall through straight from
here:

ffffffff80312dfd: 0 f0 ff 0d 60 65 78 00 lock decl 0x786560(%rip)

that's an atomic op of some global address, in the hotpath. Not good.

the wider context is:

ffffffff80312e1d: 35502 8b 44 24 0c mov 0xc(%rsp),%eax
ffffffff80312e21: 4360 23 43 10 and 0x10(%rbx),%eax
ffffffff80312e24: 0 3b 44 24 0c cmp 0xc(%rsp),%eax
ffffffff80312e28: 0 0f 85 b6 fd ff ff jne ffffffff80312be4 <avc_has_perm_noaudit+0xf1>
ffffffff80312e2e: 104641 e9 86 fd ff ff jmpq ffffffff80312bb9 <avc_has_perm_noaudit+0xc6>
ffffffff80312e33: 2106 48 83 c4 68 add $0x68,%rsp

ah, yes. My guess is that the "and (%rbx)" at ffffffff80312e21
generated this miss, and this all is avc_update_node()'s
for-each-list-loop, and:

spin_lock_irqsave(&avc_cache.slots_lock[hvalue], flag);

that hash doesnt seem to be working well here. It's done via:

static inline int avc_hash(u32 ssid, u32 tsid, u16 tclass)
{
return (ssid ^ (tsid<<2) ^ (tclass<<4)) & (AVC_CACHE_SLOTS - 1);
}

AVC_CACHE_SLOTS is 512 - but my usecase is likely has a much narrower
hash key space than that. Increasing the hash wont work, these kind of
things really only start scaling once some natural per-CPU construct
is found to it.

And things like this:

/* cache hit */
if (atomic_read(&ret->ae.used) != 1)
atomic_set(&ret->ae.used, 1);

in avc_search_node() dont really help either as they immediately dirty
the cacheline in the cache-hit case. Hashed fastpath lookup really
should only be used to validate security rules in a read-mostly way,
and cachelines should never be dirtied, as long as it can be avoided.

Anyway, this function needs a good scalability look as it represents
3.9% of the total tbench cost. I'd not be surprised if it was possible
more than half of that cost via not too ugly changes.

Ingo

2008-11-17 20:30:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28



On Mon, 17 Nov 2008, David Miller wrote:
>
> It's on my workstation which is a much simpler 2 processor
> UltraSPARC-IIIi (1.5Ghz) system.

Ok. It could easily be something like a cache footprint issue. And while I
don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is super-
scalar but does no out-of-order and speculation, no? So I could easily see
that the indirect branches in the scheduler hurt much more, and might
explain why the x86 profile looks so different.

One thing that non-NMI profiles also tend to show is "clumping", which in
turn tends to rather excessively pinpoint code sequences that release the
irq flag - just because those points show up in profiles, rather than
being a spread-out-mush. So it's possible that Ingo's profile did show the
scheduler more, but it was in the form of much more spread out "noise"
rather than the single spike you saw.

Linus

2008-11-17 20:32:58

by Ingo Molnar

[permalink] [raw]
Subject: ip_queue_xmit(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

> 100.000000 total
> ................
> 3.356152 ip_queue_xmit

hits (335615 total)
.........
ffffffff804b7045: 1001 <ip_queue_xmit>:
ffffffff804b7045: 1001 41 57 push %r15
ffffffff804b7047: 36698 41 56 push %r14
ffffffff804b7049: 0 49 89 fe mov %rdi,%r14
ffffffff804b704c: 0 41 55 push %r13
ffffffff804b704e: 447 41 54 push %r12
ffffffff804b7050: 0 55 push %rbp
ffffffff804b7051: 4 53 push %rbx
ffffffff804b7052: 465 48 83 ec 68 sub $0x68,%rsp
ffffffff804b7056: 1 89 74 24 08 mov %esi,0x8(%rsp)
ffffffff804b705a: 486 48 8b 47 28 mov 0x28(%rdi),%rax
ffffffff804b705e: 0 48 8b 6f 10 mov 0x10(%rdi),%rbp
ffffffff804b7062: 7 48 85 c0 test %rax,%rax
ffffffff804b7065: 480 48 89 44 24 58 mov %rax,0x58(%rsp)
ffffffff804b706a: 0 4c 8b bd 48 02 00 00 mov 0x248(%rbp),%r15
ffffffff804b7071: 7 0f 85 0d 01 00 00 jne ffffffff804b7184 <ip_queue_xmit+0x13f>
ffffffff804b7077: 452 31 f6 xor %esi,%esi
ffffffff804b7079: 0 48 89 ef mov %rbp,%rdi
ffffffff804b707c: 5 e8 c1 eb fc ff callq ffffffff80485c42 <__sk_dst_check>
ffffffff804b7081: 434 48 85 c0 test %rax,%rax
ffffffff804b7084: 54 48 89 44 24 58 mov %rax,0x58(%rsp)
ffffffff804b7089: 0 0f 85 e0 00 00 00 jne ffffffff804b716f <ip_queue_xmit+0x12a>
ffffffff804b708f: 0 4d 85 ff test %r15,%r15
ffffffff804b7092: 0 44 8b ad 30 02 00 00 mov 0x230(%rbp),%r13d
ffffffff804b7099: 0 74 0a je ffffffff804b70a5 <ip_queue_xmit+0x60>
ffffffff804b709b: 0 41 80 7f 05 00 cmpb $0x0,0x5(%r15)
ffffffff804b70a0: 0 74 03 je ffffffff804b70a5 <ip_queue_xmit+0x60>
ffffffff804b70a2: 0 45 8b 2f mov (%r15),%r13d
ffffffff804b70a5: 0 8b 85 3c 02 00 00 mov 0x23c(%rbp),%eax
ffffffff804b70ab: 0 48 8d b5 10 01 00 00 lea 0x110(%rbp),%rsi
ffffffff804b70b2: 0 44 8b 65 04 mov 0x4(%rbp),%r12d
ffffffff804b70b6: 0 bf 0d 00 00 00 mov $0xd,%edi
ffffffff804b70bb: 0 89 44 24 0c mov %eax,0xc(%rsp)
ffffffff804b70bf: 0 8a 9d 54 02 00 00 mov 0x254(%rbp),%bl
ffffffff804b70c5: 0 e8 9a df ff ff callq ffffffff804b5064 <constant_test_bit>
ffffffff804b70ca: 0 31 d2 xor %edx,%edx
ffffffff804b70cc: 0 48 8d 7c 24 10 lea 0x10(%rsp),%rdi
ffffffff804b70d1: 0 41 89 c3 mov %eax,%r11d
ffffffff804b70d4: 0 fc cld
ffffffff804b70d5: 0 89 d0 mov %edx,%eax
ffffffff804b70d7: 0 b9 10 00 00 00 mov $0x10,%ecx
ffffffff804b70dc: 0 44 8a 45 39 mov 0x39(%rbp),%r8b
ffffffff804b70e0: 0 40 8a b5 57 02 00 00 mov 0x257(%rbp),%sil
ffffffff804b70e7: 0 44 8b 8d 50 02 00 00 mov 0x250(%rbp),%r9d
ffffffff804b70ee: 0 83 e3 1e and $0x1e,%ebx
ffffffff804b70f1: 0 44 8b 95 38 02 00 00 mov 0x238(%rbp),%r10d
ffffffff804b70f8: 0 44 09 db or %r11d,%ebx
ffffffff804b70fb: 0 f3 ab rep stos %eax,%es:(%rdi)
ffffffff804b70fd: 0 40 c0 ee 05 shr $0x5,%sil
ffffffff804b7101: 0 88 5c 24 24 mov %bl,0x24(%rsp)
ffffffff804b7105: 0 48 8d 5c 24 10 lea 0x10(%rsp),%rbx
ffffffff804b710a: 0 83 e6 01 and $0x1,%esi
ffffffff804b710d: 0 48 89 ef mov %rbp,%rdi
ffffffff804b7110: 0 44 88 44 24 40 mov %r8b,0x40(%rsp)
ffffffff804b7115: 0 8b 44 24 0c mov 0xc(%rsp),%eax
ffffffff804b7119: 0 40 88 74 24 41 mov %sil,0x41(%rsp)
ffffffff804b711e: 0 48 89 de mov %rbx,%rsi
ffffffff804b7121: 0 66 44 89 4c 24 44 mov %r9w,0x44(%rsp)
ffffffff804b7127: 0 66 44 89 54 24 46 mov %r10w,0x46(%rsp)
ffffffff804b712d: 0 44 89 64 24 10 mov %r12d,0x10(%rsp)
ffffffff804b7132: 0 44 89 6c 24 1c mov %r13d,0x1c(%rsp)
ffffffff804b7137: 0 89 44 24 20 mov %eax,0x20(%rsp)
ffffffff804b713b: 0 e8 2d 9f e5 ff callq ffffffff8031106d <security_sk_classify_flow>
ffffffff804b7140: 0 48 8d 74 24 58 lea 0x58(%rsp),%rsi
ffffffff804b7145: 0 45 31 c0 xor %r8d,%r8d
ffffffff804b7148: 0 48 89 e9 mov %rbp,%rcx
ffffffff804b714b: 0 48 89 da mov %rbx,%rdx
ffffffff804b714e: 0 48 c7 c7 d0 15 ab 80 mov $0xffffffff80ab15d0,%rdi
ffffffff804b7155: 0 e8 1a 91 ff ff callq ffffffff804b0274 <ip_route_output_flow>
ffffffff804b715a: 0 85 c0 test %eax,%eax
ffffffff804b715c: 0 0f 85 9f 01 00 00 jne ffffffff804b7301 <ip_queue_xmit+0x2bc>
ffffffff804b7162: 0 48 8b 74 24 58 mov 0x58(%rsp),%rsi
ffffffff804b7167: 0 48 89 ef mov %rbp,%rdi
ffffffff804b716a: 0 e8 a8 eb fc ff callq ffffffff80485d17 <sk_setup_caps>
ffffffff804b716f: 441 48 8b 44 24 58 mov 0x58(%rsp),%rax
ffffffff804b7174: 1388 48 85 c0 test %rax,%rax
ffffffff804b7177: 0 74 07 je ffffffff804b7180 <ip_queue_xmit+0x13b>
ffffffff804b7179: 0 f0 ff 80 b0 00 00 00 lock incl 0xb0(%rax)
ffffffff804b7180: 556 49 89 46 28 mov %rax,0x28(%r14)
ffffffff804b7184: 8351 4d 85 ff test %r15,%r15
ffffffff804b7187: 0 be 14 00 00 00 mov $0x14,%esi
ffffffff804b718c: 461 74 26 je ffffffff804b71b4 <ip_queue_xmit+0x16f>
ffffffff804b718e: 0 41 f6 47 08 01 testb $0x1,0x8(%r15)
ffffffff804b7193: 0 74 17 je ffffffff804b71ac <ip_queue_xmit+0x167>
ffffffff804b7195: 0 48 8b 54 24 58 mov 0x58(%rsp),%rdx
ffffffff804b719a: 0 8b 82 28 01 00 00 mov 0x128(%rdx),%eax
ffffffff804b71a0: 0 39 82 1c 01 00 00 cmp %eax,0x11c(%rdx)
ffffffff804b71a6: 0 0f 85 55 01 00 00 jne ffffffff804b7301 <ip_queue_xmit+0x2bc>
ffffffff804b71ac: 0 41 0f b6 47 04 movzbl 0x4(%r15),%eax
ffffffff804b71b1: 0 8d 70 14 lea 0x14(%rax),%esi
ffffffff804b71b4: 39 4c 89 f7 mov %r14,%rdi
ffffffff804b71b7: 493 e8 f8 18 fd ff callq ffffffff80488ab4 <skb_push>
ffffffff804b71bc: 0 4c 89 f7 mov %r14,%rdi
ffffffff804b71bf: 1701 e8 99 df ff ff callq ffffffff804b515d <skb_reset_network_header>
ffffffff804b71c4: 481 0f b6 85 54 02 00 00 movzbl 0x254(%rbp),%eax
ffffffff804b71cb: 4202 41 8b 9e bc 00 00 00 mov 0xbc(%r14),%ebx
ffffffff804b71d2: 3 48 89 ef mov %rbp,%rdi
ffffffff804b71d5: 0 49 03 9e d0 00 00 00 add 0xd0(%r14),%rbx
ffffffff804b71dc: 466 80 cc 45 or $0x45,%ah
ffffffff804b71df: 7 66 c1 c0 08 rol $0x8,%ax
ffffffff804b71e3: 0 66 89 03 mov %ax,(%rbx)
ffffffff804b71e6: 492 48 8b 74 24 58 mov 0x58(%rsp),%rsi
ffffffff804b71eb: 3 e8 a0 df ff ff callq ffffffff804b5190 <ip_dont_fragment>
ffffffff804b71f0: 1405 85 c0 test %eax,%eax
ffffffff804b71f2: 4391 74 0f je ffffffff804b7203 <ip_queue_xmit+0x1be>
ffffffff804b71f4: 0 83 7c 24 08 00 cmpl $0x0,0x8(%rsp)
ffffffff804b71f9: 417 75 08 jne ffffffff804b7203 <ip_queue_xmit+0x1be>
ffffffff804b71fb: 503 66 c7 43 06 40 00 movw $0x40,0x6(%rbx)
ffffffff804b7201: 6743 eb 06 jmp ffffffff804b7209 <ip_queue_xmit+0x1c4>
ffffffff804b7203: 0 66 c7 43 06 00 00 movw $0x0,0x6(%rbx)
ffffffff804b7209: 118 0f bf 85 40 02 00 00 movswl 0x240(%rbp),%eax
ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx
ffffffff804b7215: 340 85 c0 test %eax,%eax
ffffffff804b7217: 0 79 06 jns ffffffff804b721f <ip_queue_xmit+0x1da>
ffffffff804b7219: 107464 8b 82 9c 00 00 00 mov 0x9c(%rdx),%eax
ffffffff804b721f: 4963 88 43 08 mov %al,0x8(%rbx)
ffffffff804b7222: 26297 8a 45 39 mov 0x39(%rbp),%al
ffffffff804b7225: 76658 4d 85 ff test %r15,%r15
ffffffff804b7228: 1712 88 43 09 mov %al,0x9(%rbx)
ffffffff804b722b: 148 48 8b 44 24 58 mov 0x58(%rsp),%rax
ffffffff804b7230: 2971 8b 80 20 01 00 00 mov 0x120(%rax),%eax
ffffffff804b7236: 14849 89 43 0c mov %eax,0xc(%rbx)
ffffffff804b7239: 84 48 8b 44 24 58 mov 0x58(%rsp),%rax
ffffffff804b723e: 360 8b 80 1c 01 00 00 mov 0x11c(%rax),%eax
ffffffff804b7244: 174 89 43 10 mov %eax,0x10(%rbx)
ffffffff804b7247: 96 74 32 je ffffffff804b727b <ip_queue_xmit+0x236>
ffffffff804b7249: 0 41 8a 57 04 mov 0x4(%r15),%dl
ffffffff804b724d: 0 84 d2 test %dl,%dl
ffffffff804b724f: 0 74 2a je ffffffff804b727b <ip_queue_xmit+0x236>
ffffffff804b7251: 0 c0 ea 02 shr $0x2,%dl
ffffffff804b7254: 0 03 13 add (%rbx),%edx
ffffffff804b7256: 0 8a 03 mov (%rbx),%al
ffffffff804b7258: 0 45 31 c0 xor %r8d,%r8d
ffffffff804b725b: 0 4c 89 fe mov %r15,%rsi
ffffffff804b725e: 0 4c 89 f7 mov %r14,%rdi
ffffffff804b7261: 0 83 e0 f0 and $0xfffffffffffffff0,%eax
ffffffff804b7264: 0 83 e2 0f and $0xf,%edx
ffffffff804b7267: 0 09 d0 or %edx,%eax
ffffffff804b7269: 0 88 03 mov %al,(%rbx)
ffffffff804b726b: 0 48 8b 4c 24 58 mov 0x58(%rsp),%rcx
ffffffff804b7270: 0 8b 95 30 02 00 00 mov 0x230(%rbp),%edx
ffffffff804b7276: 0 e8 e4 d8 ff ff callq ffffffff804b4b5f <ip_options_build>
ffffffff804b727b: 541 41 8b 86 c8 00 00 00 mov 0xc8(%r14),%eax
ffffffff804b7282: 570 31 d2 xor %edx,%edx
ffffffff804b7284: 0 49 03 86 d0 00 00 00 add 0xd0(%r14),%rax
ffffffff804b728b: 34 8b 40 08 mov 0x8(%rax),%eax
ffffffff804b728e: 496 66 85 c0 test %ax,%ax
ffffffff804b7291: 11 74 06 je ffffffff804b7299 <ip_queue_xmit+0x254>
ffffffff804b7293: 9 0f b7 c0 movzwl %ax,%eax
ffffffff804b7296: 495 8d 50 ff lea -0x1(%rax),%edx
ffffffff804b7299: 2 f6 43 06 40 testb $0x40,0x6(%rbx)
ffffffff804b729d: 9 48 8b 74 24 58 mov 0x58(%rsp),%rsi
ffffffff804b72a2: 497 74 34 je ffffffff804b72d8 <ip_queue_xmit+0x293>
ffffffff804b72a4: 8 83 bd 30 02 00 00 00 cmpl $0x0,0x230(%rbp)
ffffffff804b72ab: 10 74 23 je ffffffff804b72d0 <ip_queue_xmit+0x28b>
ffffffff804b72ad: 1044 66 8b 85 52 02 00 00 mov 0x252(%rbp),%ax
ffffffff804b72b4: 7 66 c1 c0 08 rol $0x8,%ax
ffffffff804b72b8: 8 66 89 43 04 mov %ax,0x4(%rbx)
ffffffff804b72bc: 432 66 8b 85 52 02 00 00 mov 0x252(%rbp),%ax
ffffffff804b72c3: 9 ff c0 inc %eax
ffffffff804b72c5: 14 01 d0 add %edx,%eax
ffffffff804b72c7: 1141 66 89 85 52 02 00 00 mov %ax,0x252(%rbp)
ffffffff804b72ce: 7 eb 10 jmp ffffffff804b72e0 <ip_queue_xmit+0x29b>
ffffffff804b72d0: 0 66 c7 43 04 00 00 movw $0x0,0x4(%rbx)
ffffffff804b72d6: 0 eb 08 jmp ffffffff804b72e0 <ip_queue_xmit+0x29b>
ffffffff804b72d8: 0 48 89 df mov %rbx,%rdi
ffffffff804b72db: 0 e8 b7 9d ff ff callq ffffffff804b1097 <__ip_select_ident>
ffffffff804b72e0: 6 8b 85 54 01 00 00 mov 0x154(%rbp),%eax
ffffffff804b72e6: 458 4c 89 f7 mov %r14,%rdi
ffffffff804b72e9: 2 41 89 46 78 mov %eax,0x78(%r14)
ffffffff804b72ed: 4 8b 85 f0 01 00 00 mov 0x1f0(%rbp),%eax
ffffffff804b72f3: 841 41 89 86 b0 00 00 00 mov %eax,0xb0(%r14)
ffffffff804b72fa: 11 e8 30 f2 ff ff callq ffffffff804b652f <ip_local_out>
ffffffff804b72ff: 0 eb 44 jmp ffffffff804b7345 <ip_queue_xmit+0x300>
ffffffff804b7301: 0 65 48 8b 04 25 10 00 mov %gs:0x10,%rax
ffffffff804b7308: 0 00 00
ffffffff804b730a: 0 8b 80 48 e0 ff ff mov -0x1fb8(%rax),%eax
ffffffff804b7310: 0 4c 89 f7 mov %r14,%rdi
ffffffff804b7313: 0 30 c0 xor %al,%al
ffffffff804b7315: 0 66 83 f8 01 cmp $0x1,%ax
ffffffff804b7319: 0 48 19 c0 sbb %rax,%rax
ffffffff804b731c: 0 83 e0 08 and $0x8,%eax
ffffffff804b731f: 0 48 8b 90 a8 16 ab 80 mov -0x7f54e958(%rax),%rdx
ffffffff804b7326: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax
ffffffff804b732d: 0 00
ffffffff804b732e: 0 89 c0 mov %eax,%eax
ffffffff804b7330: 0 48 f7 d2 not %rdx
ffffffff804b7333: 0 48 8b 04 c2 mov (%rdx,%rax,8),%rax
ffffffff804b7337: 0 48 ff 40 68 incq 0x68(%rax)
ffffffff804b733b: 0 e8 b1 18 fd ff callq ffffffff80488bf1 <kfree_skb>
ffffffff804b7340: 0 b8 8f ff ff ff mov $0xffffff8f,%eax
ffffffff804b7345: 9196 48 83 c4 68 add $0x68,%rsp
ffffffff804b7349: 892 5b pop %rbx
ffffffff804b734a: 0 5d pop %rbp
ffffffff804b734b: 488 41 5c pop %r12
ffffffff804b734d: 0 41 5d pop %r13
ffffffff804b734f: 0 41 5e pop %r14
ffffffff804b7351: 513 41 5f pop %r15
ffffffff804b7353: 0 c3 retq

about 10% of this function's cost is artificial:

ffffffff804b7045: 1001 <ip_queue_xmit>:
ffffffff804b7045: 1001 41 57 push %r15
ffffffff804b7047: 36698 41 56 push %r14

there are profiler hits that leaked in via out-of-order execution from
the callsites. The callsites are hard to map unfortunately, as this
function is called via function pointers.

the most likely callsite is tcp_transmit_skb().

30% of the overhead of this function comes from:

ffffffff804b7203: 0 66 c7 43 06 00 00 movw $0x0,0x6(%rbx)
ffffffff804b7209: 118 0f bf 85 40 02 00 00 movswl 0x240(%rbp),%eax
ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx
ffffffff804b7215: 340 85 c0 test %eax,%eax
ffffffff804b7217: 0 79 06 jns ffffffff804b721f <ip_queue_xmit+0x1da>
ffffffff804b7219: 107464 8b 82 9c 00 00 00 mov 0x9c(%rdx),%eax
ffffffff804b721f: 4963 88 43 08 mov %al,0x8(%rbx)

the 16-bit movw looks a bit weird. It comes from line 372:

0xffffffff804b7203 is in ip_queue_xmit (net/ipv4/ip_output.c:372).
367 iph = ip_hdr(skb);
368 *((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
369 if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok)
370 iph->frag_off = htons(IP_DF);
371 else
372 iph->frag_off = 0;
373 iph->ttl = ip_select_ttl(inet, &rt->u.dst);
374 iph->protocol = sk->sk_protocol;
375 iph->saddr = rt->rt_src;
376 iph->daddr = rt->rt_dst;

the ip-header fragment flag setting to zero.

16-bit ops are an on-off love/hate affair on x86 CPUs. The trend is
towards eliminating them as much as possible.

_But_, the real overhead probably comes from:

ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx

which is the next line, the ttl field:

373 iph->ttl = ip_select_ttl(inet, &rt->u.dst);

this shows that we are doing a hard cachemiss on the net-localhost
route dst structure cacheline. We do a plain load instruction from it
here and get a hefty cachemiss. (because 16 CPUs are banging on that
single route)

And let make sure we see this in perspective as well: that single
cachemiss is _1.0 percent_ of the total tbench cost. (!) We could make
the scheduler 10% slower straight away and it would have less of a
real-life effect than this single iph->ttl field setting.

Ingo

2008-11-17 20:48:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

> 100.000000 total
> ................
> 3.038025 skb_release_data

hits (303802 total)
.........
ffffffff80488c7e: 780 <skb_release_data>:
ffffffff80488c7e: 780 55 push %rbp
ffffffff80488c7f: 267141 53 push %rbx
ffffffff80488c80: 0 48 89 fb mov %rdi,%rbx
ffffffff80488c83: 3552 48 83 ec 08 sub $0x8,%rsp
ffffffff80488c87: 604 8a 47 7c mov 0x7c(%rdi),%al
ffffffff80488c8a: 2644 a8 02 test $0x2,%al
ffffffff80488c8c: 49 74 2a je ffffffff80488cb8 <skb_release_data+0x3a>
ffffffff80488c8e: 0 83 e0 10 and $0x10,%eax
ffffffff80488c91: 2079 8b 97 c8 00 00 00 mov 0xc8(%rdi),%edx
ffffffff80488c97: 53 3c 01 cmp $0x1,%al
ffffffff80488c99: 0 19 c0 sbb %eax,%eax
ffffffff80488c9b: 870 48 03 97 d0 00 00 00 add 0xd0(%rdi),%rdx
ffffffff80488ca2: 65 66 31 c0 xor %ax,%ax
ffffffff80488ca5: 0 05 01 00 01 00 add $0x10001,%eax
ffffffff80488caa: 888 f7 d8 neg %eax
ffffffff80488cac: 49 89 c1 mov %eax,%ecx
ffffffff80488cae: 0 f0 0f c1 0a lock xadd %ecx,(%rdx)
ffffffff80488cb2: 1909 01 c8 add %ecx,%eax
ffffffff80488cb4: 1040 85 c0 test %eax,%eax
ffffffff80488cb6: 0 75 6d jne ffffffff80488d25 <skb_release_data+0xa7>
ffffffff80488cb8: 0 8b 93 c8 00 00 00 mov 0xc8(%rbx),%edx
ffffffff80488cbe: 4199 48 8b 83 d0 00 00 00 mov 0xd0(%rbx),%rax
ffffffff80488cc5: 4995 31 ed xor %ebp,%ebp
ffffffff80488cc7: 0 66 83 7c 10 04 00 cmpw $0x0,0x4(%rax,%rdx,1)
ffffffff80488ccd: 983 75 15 jne ffffffff80488ce4 <skb_release_data+0x66>
ffffffff80488ccf: 15 eb 28 jmp ffffffff80488cf9 <skb_release_data+0x7b>
ffffffff80488cd1: 665 48 63 c5 movslq %ebp,%rax
ffffffff80488cd4: 546 ff c5 inc %ebp
ffffffff80488cd6: 328 48 c1 e0 04 shl $0x4,%rax
ffffffff80488cda: 356 48 8b 7c 02 20 mov 0x20(%rdx,%rax,1),%rdi
ffffffff80488cdf: 95 e8 be 87 de ff callq ffffffff802714a2 <put_page>
ffffffff80488ce4: 66 8b 93 c8 00 00 00 mov 0xc8(%rbx),%edx
ffffffff80488cea: 1321 48 03 93 d0 00 00 00 add 0xd0(%rbx),%rdx
ffffffff80488cf1: 439 0f b7 42 04 movzwl 0x4(%rdx),%eax
ffffffff80488cf5: 0 39 c5 cmp %eax,%ebp
ffffffff80488cf7: 1887 7c d8 jl ffffffff80488cd1 <skb_release_data+0x53>
ffffffff80488cf9: 2187 8b 93 c8 00 00 00 mov 0xc8(%rbx),%edx
ffffffff80488cff: 1784 48 8b 83 d0 00 00 00 mov 0xd0(%rbx),%rax
ffffffff80488d06: 422 48 83 7c 10 18 00 cmpq $0x0,0x18(%rax,%rdx,1)
ffffffff80488d0c: 110 74 08 je ffffffff80488d16 <skb_release_data+0x98>
ffffffff80488d0e: 0 48 89 df mov %rbx,%rdi
ffffffff80488d11: 0 e8 52 ff ff ff callq ffffffff80488c68 <skb_drop_fraglist>
ffffffff80488d16: 14 48 8b bb d0 00 00 00 mov 0xd0(%rbx),%rdi
ffffffff80488d1d: 715 5e pop %rsi
ffffffff80488d1e: 109 5b pop %rbx
ffffffff80488d1f: 20 5d pop %rbp
ffffffff80488d20: 980 e9 b7 66 e0 ff jmpq ffffffff8028f3dc <kfree>
ffffffff80488d25: 0 59 pop %rcx
ffffffff80488d26: 1948 5b pop %rbx
ffffffff80488d27: 0 5d pop %rbp
ffffffff80488d28: 0 c3 retq

this is a short function, and 90% of the overhead is false leaked-in
overhead from callsites:

ffffffff80488c7f: 267141 53 push %rbx

unfortunately i have a hard time mapping its callsites.
pskb_expand_head() is the only static callsite, but it's not active in
the profile.

The _usual_ callsite is normally skb_release_all(), which does have
overhead:

ffffffff80489449: 925 <skb_release_all>:
ffffffff80489449: 925 53 push %rbx
ffffffff8048944a: 5249 48 89 fb mov %rdi,%rbx
ffffffff8048944d: 4 e8 3c ff ff ff callq ffffffff8048938e <skb_release_head_state>
ffffffff80489452: 1149 48 89 df mov %rbx,%rdi
ffffffff80489455: 13163 5b pop %rbx
ffffffff80489456: 0 e9 23 f8 ff ff jmpq ffffffff80488c7e <skb_release_data>

it is also tail-optimized, which explains why i found little
callsites. The main callsite of skb_release_all() is:

ffffffff80488b86: 26 e8 be 08 00 00 callq ffffffff80489449 <skb_release_all>

which is __kfree_skb(). That is a frequently referenced function, and
in my profile there's a single callsite active:

ffffffff804c1027: 432 e8 56 7b fc ff callq ffffffff80488b82 <__kfree_skb>

which is tcp_ack() - subject of a later email. The wider context is:

ffffffff804c0ffc: 433 41 2b 85 e0 00 00 00 sub 0xe0(%r13),%eax
ffffffff804c1003: 4843 89 85 f0 00 00 00 mov %eax,0xf0(%rbp)
ffffffff804c1009: 1730 48 8b 45 30 mov 0x30(%rbp),%rax
ffffffff804c100d: 311 41 8b 95 e0 00 00 00 mov 0xe0(%r13),%edx
ffffffff804c1014: 0 48 83 b8 b0 00 00 00 cmpq $0x0,0xb0(%rax)
ffffffff804c101b: 0 00
ffffffff804c101c: 418 74 06 je ffffffff804c1024 <tcp_ack+0x50d>
ffffffff804c101e: 37 01 95 f4 00 00 00 add %edx,0xf4(%rbp)
ffffffff804c1024: 2 4c 89 ef mov %r13,%rdi
ffffffff804c1027: 432 e8 56 7b fc ff callq ffffffff80488b82 <__kfree_skb>

this is a good, top-of-the-line x86 CPU with a really good BTB
implementation that seems to be able to fall through calls and tail
optimizations as if they werent there.

some guesses are:

(gdb) list *0xffffffff804c1003
0xffffffff804c1003 is in tcp_ack (include/net/sock.h:789).
784
785 static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
786 {
787 skb_truesize_check(skb);
788 sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
789 sk->sk_wmem_queued -= skb->truesize;
790 sk_mem_uncharge(sk, skb->truesize);
791 __kfree_skb(skb);
792 }
793

both sk and skb should be cache-hot here so this seems unlikely.

(gdb) list *0xffffffff804c10090xffffffff804c1009 is in tcp_ack (include/net/sock.h:736).
731 }
732
733 static inline int sk_has_account(struct sock *sk)
734 {
735 /* return true if protocol supports memory accounting */
736 return !!sk->sk_prot->memory_allocated;
737 }
738
739 static inline int sk_wmem_schedule(struct sock *sk, int size)
740 {

this cannot be it - unless sk_prot somehow ends up being dirtied or
false-shared?

Still, my guess would be on ffffffff804c1009 and a
sk_prot->memory_allocated cachemiss: look at how this instruction uses
%ebp, and the one that shows the many hits in skb_release_data()
pushes %ebp to the stack - that's where the CPU's OOO trick ends: it
has to compute the result and serialize on the cachemiss.

Ingo

2008-11-17 20:56:18

by Ingo Molnar

[permalink] [raw]
Subject: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

> 100.000000 total
> ................
> 2.118525 skb_release_head_state

hits (total: 211852)
.........
ffffffff8048938e: 967 <skb_release_head_state>:
ffffffff8048938e: 967 53 push %rbx
ffffffff8048938f: 3975 48 89 fb mov %rdi,%rbx
ffffffff80489392: 17 48 8b 7f 28 mov 0x28(%rdi),%rdi
ffffffff80489396: 0 e8 9c 93 00 00 callq ffffffff80492737 <dst_release>
ffffffff8048939b: 6 48 8b 7b 30 mov 0x30(%rbx),%rdi
ffffffff8048939f: 2887 48 85 ff test %rdi,%rdi
ffffffff804893a2: 859 74 0f je ffffffff804893b3 <skb_release_head_state+0x25>
ffffffff804893a4: 0 f0 ff 0f lock decl (%rdi)
ffffffff804893a7: 0 0f 94 c0 sete %al
ffffffff804893aa: 0 84 c0 test %al,%al
ffffffff804893ac: 0 74 05 je ffffffff804893b3 <skb_release_head_state+0x25>
ffffffff804893ae: 0 e8 7a 14 06 00 callq ffffffff804ea82d <__secpath_destroy>
ffffffff804893b3: 16 48 83 bb 80 00 00 00 cmpq $0x0,0x80(%rbx)
ffffffff804893ba: 0 00
ffffffff804893bb: 4294 74 31 je ffffffff804893ee <skb_release_head_state+0x60>
ffffffff804893bd: 0 65 48 8b 04 25 10 00 mov %gs:0x10,%rax
ffffffff804893c4: 0 00 00
ffffffff804893c6: 6540 48 63 80 48 e0 ff ff movslq -0x1fb8(%rax),%rax
ffffffff804893cd: 14 a9 00 00 ff 0f test $0xfff0000,%eax
ffffffff804893d2: 471 74 11 je ffffffff804893e5 <skb_release_head_state+0x57>
ffffffff804893d4: 0 be 89 01 00 00 mov $0x189,%esi
ffffffff804893d9: 0 48 c7 c7 cc b1 6a 80 mov $0xffffffff806ab1cc,%rdi
ffffffff804893e0: 0 e8 d0 cd da ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804893e5: 0 48 89 df mov %rbx,%rdi
ffffffff804893e8: 1733 ff 93 80 00 00 00 callq *0x80(%rbx)
ffffffff804893ee: 888 48 8b bb 88 00 00 00 mov 0x88(%rbx),%rdi
ffffffff804893f5: 3959 48 85 ff test %rdi,%rdi
ffffffff804893f8: 0 74 0f je ffffffff80489409 <skb_release_head_state+0x7b>
ffffffff804893fa: 0 f0 ff 0f lock decl (%rdi)
ffffffff804893fd: 0 0f 94 c0 sete %al
ffffffff80489400: 0 84 c0 test %al,%al
ffffffff80489402: 0 74 05 je ffffffff80489409 <skb_release_head_state+0x7b>
ffffffff80489404: 0 e8 48 f2 01 00 callq ffffffff804a8651 <nf_conntrack_destroy>
ffffffff80489409: 0 48 8b bb 90 00 00 00 mov 0x90(%rbx),%rdi
ffffffff80489410: 3132 48 85 ff test %rdi,%rdi
ffffffff80489413: 1 74 05 je ffffffff8048941a <skb_release_head_state+0x8c>
ffffffff80489415: 0 e8 d7 f7 ff ff callq ffffffff80488bf1 <kfree_skb>
ffffffff8048941a: 958 48 8b bb 98 00 00 00 mov 0x98(%rbx),%rdi
ffffffff80489421: 1999 48 85 ff test %rdi,%rdi
ffffffff80489424: 0 74 0f je ffffffff80489435 <skb_release_head_state+0xa7>
ffffffff80489426: 0 f0 ff 0f lock decl (%rdi)
ffffffff80489429: 0 0f 94 c0 sete %al
ffffffff8048942c: 0 84 c0 test %al,%al
ffffffff8048942e: 0 74 05 je ffffffff80489435 <skb_release_head_state+0xa7>
ffffffff80489430: 0 e8 a7 5f e0 ff callq ffffffff8028f3dc <kfree>
ffffffff80489435: 0 66 c7 83 a6 00 00 00 movw $0x0,0xa6(%rbx)
ffffffff8048943c: 0 00 00
ffffffff8048943e: 6503 66 c7 83 a8 00 00 00 movw $0x0,0xa8(%rbx)
ffffffff80489445: 0 00 00
ffffffff80489447: 174101 5b pop %rbx
ffffffff80489448: 0 c3 retq

this function _really_ hurts from a 16-bit op:

ffffffff8048943e: 6503 66 c7 83 a8 00 00 00 movw $0x0,0xa8(%rbx)
ffffffff80489445: 0 00 00
ffffffff80489447: 174101 5b pop %rbx

(gdb) list *0xffffffff8048943e
0xffffffff8048943e is in skb_release_head_state
(net/core/skbuff.c:407).
402 #endif
403 /* XXX: IS this still necessary? - JHS */
404 #ifdef CONFIG_NET_SCHED
405 skb->tc_index = 0;
406 #ifdef CONFIG_NET_CLS_ACT
407 skb->tc_verd = 0;
408 #endif
409 #endif
410 }
411

dirtying skb->tc_verd. I do have:

CONFIG_NET_CLS_ACT=y

BUT, on a second look, i dont think it's really this 16-bit op that
hurts us. The wider context is:

ffffffff80489426: 0 f0 ff 0f lock decl (%rdi)
ffffffff80489429: 0 0f 94 c0 sete %al
ffffffff8048942c: 0 84 c0 test %al,%al
ffffffff8048942e: 0 74 05 je ffffffff80489435 <skb_release_head_state+0xa7>
ffffffff80489430: 0 e8 a7 5f e0 ff callq ffffffff8028f3dc <kfree>
ffffffff80489435: 0 66 c7 83 a6 00 00 00 movw $0x0,0xa6(%rbx)
ffffffff8048943c: 0 00 00
ffffffff8048943e: 6503 66 c7 83 a8 00 00 00 movw $0x0,0xa8(%rbx)
ffffffff80489445: 0 00 00
ffffffff80489447: 174101 5b pop %rbx
ffffffff80489448: 0 c3 retq

look how we jump over the callq most of the time - so what we are
seeing here i believe is the cost of the atomic op at
ffffffff80489426. That comes from:

(gdb) list *0xffffffff8048942e
0xffffffff8048942e is in skb_release_head_state (include/linux/skbuff.h:1783).
1778 }
1779 #endif
1780 #ifdef CONFIG_BRIDGE_NETFILTER
1781 static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge)
1782 {
1783 if (nf_bridge && atomic_dec_and_test(&nf_bridge->use))
1784 kfree(nf_bridge);
1785 }
1786 static inline void nf_bridge_get(struct nf_bridge_info *nf_bridge)
1787 {

and ouch does that global dec on &nf_bridge->use hurt!

i do have:

CONFIG_BRIDGE_NETFILTER=y

(this is a Fedora distro kernel derived .config)

Ingo

2008-11-17 20:56:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Ingo Molnar a ?crit :
> * Ingo Molnar <[email protected]> wrote:
>
>> 100.000000 total
>> ................
>> 3.038025 skb_release_data
>
> hits (303802 total)
> .........
> ffffffff80488c7e: 780 <skb_release_data>:
> ffffffff80488c7e: 780 55 push %rbp
> ffffffff80488c7f: 267141 53 push %rbx
> ffffffff80488c80: 0 48 89 fb mov %rdi,%rbx
> ffffffff80488c83: 3552 48 83 ec 08 sub $0x8,%rsp
> ffffffff80488c87: 604 8a 47 7c mov 0x7c(%rdi),%al
> ffffffff80488c8a: 2644 a8 02 test $0x2,%al
> ffffffff80488c8c: 49 74 2a je ffffffff80488cb8 <skb_release_data+0x3a>
> ffffffff80488c8e: 0 83 e0 10 and $0x10,%eax
> ffffffff80488c91: 2079 8b 97 c8 00 00 00 mov 0xc8(%rdi),%edx
> ffffffff80488c97: 53 3c 01 cmp $0x1,%al
> ffffffff80488c99: 0 19 c0 sbb %eax,%eax
> ffffffff80488c9b: 870 48 03 97 d0 00 00 00 add 0xd0(%rdi),%rdx
> ffffffff80488ca2: 65 66 31 c0 xor %ax,%ax
> ffffffff80488ca5: 0 05 01 00 01 00 add $0x10001,%eax
> ffffffff80488caa: 888 f7 d8 neg %eax
> ffffffff80488cac: 49 89 c1 mov %eax,%ecx
> ffffffff80488cae: 0 f0 0f c1 0a lock xadd %ecx,(%rdx)
> ffffffff80488cb2: 1909 01 c8 add %ecx,%eax
> ffffffff80488cb4: 1040 85 c0 test %eax,%eax
> ffffffff80488cb6: 0 75 6d jne ffffffff80488d25 <skb_release_data+0xa7>
> ffffffff80488cb8: 0 8b 93 c8 00 00 00 mov 0xc8(%rbx),%edx
> ffffffff80488cbe: 4199 48 8b 83 d0 00 00 00 mov 0xd0(%rbx),%rax
> ffffffff80488cc5: 4995 31 ed xor %ebp,%ebp
> ffffffff80488cc7: 0 66 83 7c 10 04 00 cmpw $0x0,0x4(%rax,%rdx,1)
> ffffffff80488ccd: 983 75 15 jne ffffffff80488ce4 <skb_release_data+0x66>
> ffffffff80488ccf: 15 eb 28 jmp ffffffff80488cf9 <skb_release_data+0x7b>
> ffffffff80488cd1: 665 48 63 c5 movslq %ebp,%rax
> ffffffff80488cd4: 546 ff c5 inc %ebp
> ffffffff80488cd6: 328 48 c1 e0 04 shl $0x4,%rax
> ffffffff80488cda: 356 48 8b 7c 02 20 mov 0x20(%rdx,%rax,1),%rdi
> ffffffff80488cdf: 95 e8 be 87 de ff callq ffffffff802714a2 <put_page>
> ffffffff80488ce4: 66 8b 93 c8 00 00 00 mov 0xc8(%rbx),%edx
> ffffffff80488cea: 1321 48 03 93 d0 00 00 00 add 0xd0(%rbx),%rdx
> ffffffff80488cf1: 439 0f b7 42 04 movzwl 0x4(%rdx),%eax
> ffffffff80488cf5: 0 39 c5 cmp %eax,%ebp
> ffffffff80488cf7: 1887 7c d8 jl ffffffff80488cd1 <skb_release_data+0x53>
> ffffffff80488cf9: 2187 8b 93 c8 00 00 00 mov 0xc8(%rbx),%edx
> ffffffff80488cff: 1784 48 8b 83 d0 00 00 00 mov 0xd0(%rbx),%rax
> ffffffff80488d06: 422 48 83 7c 10 18 00 cmpq $0x0,0x18(%rax,%rdx,1)
> ffffffff80488d0c: 110 74 08 je ffffffff80488d16 <skb_release_data+0x98>
> ffffffff80488d0e: 0 48 89 df mov %rbx,%rdi
> ffffffff80488d11: 0 e8 52 ff ff ff callq ffffffff80488c68 <skb_drop_fraglist>
> ffffffff80488d16: 14 48 8b bb d0 00 00 00 mov 0xd0(%rbx),%rdi
> ffffffff80488d1d: 715 5e pop %rsi
> ffffffff80488d1e: 109 5b pop %rbx
> ffffffff80488d1f: 20 5d pop %rbp
> ffffffff80488d20: 980 e9 b7 66 e0 ff jmpq ffffffff8028f3dc <kfree>
> ffffffff80488d25: 0 59 pop %rcx
> ffffffff80488d26: 1948 5b pop %rbx
> ffffffff80488d27: 0 5d pop %rbp
> ffffffff80488d28: 0 c3 retq
>
> this is a short function, and 90% of the overhead is false leaked-in
> overhead from callsites:
>
> ffffffff80488c7f: 267141 53 push %rbx
>
> unfortunately i have a hard time mapping its callsites.
> pskb_expand_head() is the only static callsite, but it's not active in
> the profile.
>
> The _usual_ callsite is normally skb_release_all(), which does have
> overhead:
>
> ffffffff80489449: 925 <skb_release_all>:
> ffffffff80489449: 925 53 push %rbx
> ffffffff8048944a: 5249 48 89 fb mov %rdi,%rbx
> ffffffff8048944d: 4 e8 3c ff ff ff callq ffffffff8048938e <skb_release_head_state>
> ffffffff80489452: 1149 48 89 df mov %rbx,%rdi
> ffffffff80489455: 13163 5b pop %rbx
> ffffffff80489456: 0 e9 23 f8 ff ff jmpq ffffffff80488c7e <skb_release_data>
>
> it is also tail-optimized, which explains why i found little
> callsites. The main callsite of skb_release_all() is:
>
> ffffffff80488b86: 26 e8 be 08 00 00 callq ffffffff80489449 <skb_release_all>
>
> which is __kfree_skb(). That is a frequently referenced function, and
> in my profile there's a single callsite active:
>
> ffffffff804c1027: 432 e8 56 7b fc ff callq ffffffff80488b82 <__kfree_skb>
>
> which is tcp_ack() - subject of a later email. The wider context is:
>
> ffffffff804c0ffc: 433 41 2b 85 e0 00 00 00 sub 0xe0(%r13),%eax
> ffffffff804c1003: 4843 89 85 f0 00 00 00 mov %eax,0xf0(%rbp)
> ffffffff804c1009: 1730 48 8b 45 30 mov 0x30(%rbp),%rax
> ffffffff804c100d: 311 41 8b 95 e0 00 00 00 mov 0xe0(%r13),%edx
> ffffffff804c1014: 0 48 83 b8 b0 00 00 00 cmpq $0x0,0xb0(%rax)
> ffffffff804c101b: 0 00
> ffffffff804c101c: 418 74 06 je ffffffff804c1024 <tcp_ack+0x50d>
> ffffffff804c101e: 37 01 95 f4 00 00 00 add %edx,0xf4(%rbp)
> ffffffff804c1024: 2 4c 89 ef mov %r13,%rdi
> ffffffff804c1027: 432 e8 56 7b fc ff callq ffffffff80488b82 <__kfree_skb>
>
> this is a good, top-of-the-line x86 CPU with a really good BTB
> implementation that seems to be able to fall through calls and tail
> optimizations as if they werent there.
>
> some guesses are:
>
> (gdb) list *0xffffffff804c1003
> 0xffffffff804c1003 is in tcp_ack (include/net/sock.h:789).
> 784
> 785 static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
> 786 {
> 787 skb_truesize_check(skb);
> 788 sock_set_flag(sk, SOCK_QUEUE_SHRUNK);
> 789 sk->sk_wmem_queued -= skb->truesize;
> 790 sk_mem_uncharge(sk, skb->truesize);
> 791 __kfree_skb(skb);
> 792 }
> 793
>
> both sk and skb should be cache-hot here so this seems unlikely.
>
> (gdb) list *0xffffffff804c10090xffffffff804c1009 is in tcp_ack (include/net/sock.h:736).
> 731 }
> 732
> 733 static inline int sk_has_account(struct sock *sk)
> 734 {
> 735 /* return true if protocol supports memory accounting */
> 736 return !!sk->sk_prot->memory_allocated;
> 737 }
> 738
> 739 static inline int sk_wmem_schedule(struct sock *sk, int size)
> 740 {
>
> this cannot be it - unless sk_prot somehow ends up being dirtied or
> false-shared?
>
> Still, my guess would be on ffffffff804c1009 and a
> sk_prot->memory_allocated cachemiss: look at how this instruction uses
> %ebp, and the one that shows the many hits in skb_release_data()
> pushes %ebp to the stack - that's where the CPU's OOO trick ends: it
> has to compute the result and serialize on the cachemiss.
>

I did some investigation on this part (memory_allocated) and discovered UDP had a problem,
not TCP (and tbench)

commit 270acefafeb74ce2fe93d35b75733870bf1e11e7

net: sk_free_datagram() should use sk_mem_reclaim_partial()

I noticed a contention on udp_memory_allocated on regular UDP applications.

While tcp_memory_allocated is seldom used, it appears each incoming UDP frame
is currently touching udp_memory_allocated when queued, and when received by
application.

One possible solution is to use sk_mem_reclaim_partial() instead of
sk_mem_reclaim(), so that we keep a small reserve (less than one page)
of memory for each UDP socket.

We did something very similar on TCP side in commit
9993e7d313e80bdc005d09c7def91903e0068f07
([TCP]: Do not purge sk_forward_alloc entirely in tcp_delack_timer())

A more complex solution would need to convert prot->memory_allocated to
use a percpu_counter with batches of 64 or 128 pages.

Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

2008-11-17 20:58:24

by Eric Dumazet

[permalink] [raw]
Subject: Re: ip_queue_xmit(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Ingo Molnar a ?crit :
> * Ingo Molnar <[email protected]> wrote:
>
>> 100.000000 total
>> ................
>> 3.356152 ip_queue_xmit
>
> hits (335615 total)
> .........
> ffffffff804b7045: 1001 <ip_queue_xmit>:
> ffffffff804b7045: 1001 41 57 push %r15
> ffffffff804b7047: 36698 41 56 push %r14
> ffffffff804b7049: 0 49 89 fe mov %rdi,%r14
> ffffffff804b704c: 0 41 55 push %r13
> ffffffff804b704e: 447 41 54 push %r12
> ffffffff804b7050: 0 55 push %rbp
> ffffffff804b7051: 4 53 push %rbx
> ffffffff804b7052: 465 48 83 ec 68 sub $0x68,%rsp
> ffffffff804b7056: 1 89 74 24 08 mov %esi,0x8(%rsp)
> ffffffff804b705a: 486 48 8b 47 28 mov 0x28(%rdi),%rax
> ffffffff804b705e: 0 48 8b 6f 10 mov 0x10(%rdi),%rbp
> ffffffff804b7062: 7 48 85 c0 test %rax,%rax
> ffffffff804b7065: 480 48 89 44 24 58 mov %rax,0x58(%rsp)
> ffffffff804b706a: 0 4c 8b bd 48 02 00 00 mov 0x248(%rbp),%r15
> ffffffff804b7071: 7 0f 85 0d 01 00 00 jne ffffffff804b7184 <ip_queue_xmit+0x13f>
> ffffffff804b7077: 452 31 f6 xor %esi,%esi
> ffffffff804b7079: 0 48 89 ef mov %rbp,%rdi
> ffffffff804b707c: 5 e8 c1 eb fc ff callq ffffffff80485c42 <__sk_dst_check>
> ffffffff804b7081: 434 48 85 c0 test %rax,%rax
> ffffffff804b7084: 54 48 89 44 24 58 mov %rax,0x58(%rsp)
> ffffffff804b7089: 0 0f 85 e0 00 00 00 jne ffffffff804b716f <ip_queue_xmit+0x12a>
> ffffffff804b708f: 0 4d 85 ff test %r15,%r15
> ffffffff804b7092: 0 44 8b ad 30 02 00 00 mov 0x230(%rbp),%r13d
> ffffffff804b7099: 0 74 0a je ffffffff804b70a5 <ip_queue_xmit+0x60>
> ffffffff804b709b: 0 41 80 7f 05 00 cmpb $0x0,0x5(%r15)
> ffffffff804b70a0: 0 74 03 je ffffffff804b70a5 <ip_queue_xmit+0x60>
> ffffffff804b70a2: 0 45 8b 2f mov (%r15),%r13d
> ffffffff804b70a5: 0 8b 85 3c 02 00 00 mov 0x23c(%rbp),%eax
> ffffffff804b70ab: 0 48 8d b5 10 01 00 00 lea 0x110(%rbp),%rsi
> ffffffff804b70b2: 0 44 8b 65 04 mov 0x4(%rbp),%r12d
> ffffffff804b70b6: 0 bf 0d 00 00 00 mov $0xd,%edi
> ffffffff804b70bb: 0 89 44 24 0c mov %eax,0xc(%rsp)
> ffffffff804b70bf: 0 8a 9d 54 02 00 00 mov 0x254(%rbp),%bl
> ffffffff804b70c5: 0 e8 9a df ff ff callq ffffffff804b5064 <constant_test_bit>
> ffffffff804b70ca: 0 31 d2 xor %edx,%edx
> ffffffff804b70cc: 0 48 8d 7c 24 10 lea 0x10(%rsp),%rdi
> ffffffff804b70d1: 0 41 89 c3 mov %eax,%r11d
> ffffffff804b70d4: 0 fc cld
> ffffffff804b70d5: 0 89 d0 mov %edx,%eax
> ffffffff804b70d7: 0 b9 10 00 00 00 mov $0x10,%ecx
> ffffffff804b70dc: 0 44 8a 45 39 mov 0x39(%rbp),%r8b
> ffffffff804b70e0: 0 40 8a b5 57 02 00 00 mov 0x257(%rbp),%sil
> ffffffff804b70e7: 0 44 8b 8d 50 02 00 00 mov 0x250(%rbp),%r9d
> ffffffff804b70ee: 0 83 e3 1e and $0x1e,%ebx
> ffffffff804b70f1: 0 44 8b 95 38 02 00 00 mov 0x238(%rbp),%r10d
> ffffffff804b70f8: 0 44 09 db or %r11d,%ebx
> ffffffff804b70fb: 0 f3 ab rep stos %eax,%es:(%rdi)
> ffffffff804b70fd: 0 40 c0 ee 05 shr $0x5,%sil
> ffffffff804b7101: 0 88 5c 24 24 mov %bl,0x24(%rsp)
> ffffffff804b7105: 0 48 8d 5c 24 10 lea 0x10(%rsp),%rbx
> ffffffff804b710a: 0 83 e6 01 and $0x1,%esi
> ffffffff804b710d: 0 48 89 ef mov %rbp,%rdi
> ffffffff804b7110: 0 44 88 44 24 40 mov %r8b,0x40(%rsp)
> ffffffff804b7115: 0 8b 44 24 0c mov 0xc(%rsp),%eax
> ffffffff804b7119: 0 40 88 74 24 41 mov %sil,0x41(%rsp)
> ffffffff804b711e: 0 48 89 de mov %rbx,%rsi
> ffffffff804b7121: 0 66 44 89 4c 24 44 mov %r9w,0x44(%rsp)
> ffffffff804b7127: 0 66 44 89 54 24 46 mov %r10w,0x46(%rsp)
> ffffffff804b712d: 0 44 89 64 24 10 mov %r12d,0x10(%rsp)
> ffffffff804b7132: 0 44 89 6c 24 1c mov %r13d,0x1c(%rsp)
> ffffffff804b7137: 0 89 44 24 20 mov %eax,0x20(%rsp)
> ffffffff804b713b: 0 e8 2d 9f e5 ff callq ffffffff8031106d <security_sk_classify_flow>
> ffffffff804b7140: 0 48 8d 74 24 58 lea 0x58(%rsp),%rsi
> ffffffff804b7145: 0 45 31 c0 xor %r8d,%r8d
> ffffffff804b7148: 0 48 89 e9 mov %rbp,%rcx
> ffffffff804b714b: 0 48 89 da mov %rbx,%rdx
> ffffffff804b714e: 0 48 c7 c7 d0 15 ab 80 mov $0xffffffff80ab15d0,%rdi
> ffffffff804b7155: 0 e8 1a 91 ff ff callq ffffffff804b0274 <ip_route_output_flow>
> ffffffff804b715a: 0 85 c0 test %eax,%eax
> ffffffff804b715c: 0 0f 85 9f 01 00 00 jne ffffffff804b7301 <ip_queue_xmit+0x2bc>
> ffffffff804b7162: 0 48 8b 74 24 58 mov 0x58(%rsp),%rsi
> ffffffff804b7167: 0 48 89 ef mov %rbp,%rdi
> ffffffff804b716a: 0 e8 a8 eb fc ff callq ffffffff80485d17 <sk_setup_caps>
> ffffffff804b716f: 441 48 8b 44 24 58 mov 0x58(%rsp),%rax
> ffffffff804b7174: 1388 48 85 c0 test %rax,%rax
> ffffffff804b7177: 0 74 07 je ffffffff804b7180 <ip_queue_xmit+0x13b>
> ffffffff804b7179: 0 f0 ff 80 b0 00 00 00 lock incl 0xb0(%rax)
> ffffffff804b7180: 556 49 89 46 28 mov %rax,0x28(%r14)
> ffffffff804b7184: 8351 4d 85 ff test %r15,%r15
> ffffffff804b7187: 0 be 14 00 00 00 mov $0x14,%esi
> ffffffff804b718c: 461 74 26 je ffffffff804b71b4 <ip_queue_xmit+0x16f>
> ffffffff804b718e: 0 41 f6 47 08 01 testb $0x1,0x8(%r15)
> ffffffff804b7193: 0 74 17 je ffffffff804b71ac <ip_queue_xmit+0x167>
> ffffffff804b7195: 0 48 8b 54 24 58 mov 0x58(%rsp),%rdx
> ffffffff804b719a: 0 8b 82 28 01 00 00 mov 0x128(%rdx),%eax
> ffffffff804b71a0: 0 39 82 1c 01 00 00 cmp %eax,0x11c(%rdx)
> ffffffff804b71a6: 0 0f 85 55 01 00 00 jne ffffffff804b7301 <ip_queue_xmit+0x2bc>
> ffffffff804b71ac: 0 41 0f b6 47 04 movzbl 0x4(%r15),%eax
> ffffffff804b71b1: 0 8d 70 14 lea 0x14(%rax),%esi
> ffffffff804b71b4: 39 4c 89 f7 mov %r14,%rdi
> ffffffff804b71b7: 493 e8 f8 18 fd ff callq ffffffff80488ab4 <skb_push>
> ffffffff804b71bc: 0 4c 89 f7 mov %r14,%rdi
> ffffffff804b71bf: 1701 e8 99 df ff ff callq ffffffff804b515d <skb_reset_network_header>
> ffffffff804b71c4: 481 0f b6 85 54 02 00 00 movzbl 0x254(%rbp),%eax
> ffffffff804b71cb: 4202 41 8b 9e bc 00 00 00 mov 0xbc(%r14),%ebx
> ffffffff804b71d2: 3 48 89 ef mov %rbp,%rdi
> ffffffff804b71d5: 0 49 03 9e d0 00 00 00 add 0xd0(%r14),%rbx
> ffffffff804b71dc: 466 80 cc 45 or $0x45,%ah
> ffffffff804b71df: 7 66 c1 c0 08 rol $0x8,%ax
> ffffffff804b71e3: 0 66 89 03 mov %ax,(%rbx)
> ffffffff804b71e6: 492 48 8b 74 24 58 mov 0x58(%rsp),%rsi
> ffffffff804b71eb: 3 e8 a0 df ff ff callq ffffffff804b5190 <ip_dont_fragment>
> ffffffff804b71f0: 1405 85 c0 test %eax,%eax
> ffffffff804b71f2: 4391 74 0f je ffffffff804b7203 <ip_queue_xmit+0x1be>
> ffffffff804b71f4: 0 83 7c 24 08 00 cmpl $0x0,0x8(%rsp)
> ffffffff804b71f9: 417 75 08 jne ffffffff804b7203 <ip_queue_xmit+0x1be>
> ffffffff804b71fb: 503 66 c7 43 06 40 00 movw $0x40,0x6(%rbx)
> ffffffff804b7201: 6743 eb 06 jmp ffffffff804b7209 <ip_queue_xmit+0x1c4>
> ffffffff804b7203: 0 66 c7 43 06 00 00 movw $0x0,0x6(%rbx)
> ffffffff804b7209: 118 0f bf 85 40 02 00 00 movswl 0x240(%rbp),%eax
> ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx
> ffffffff804b7215: 340 85 c0 test %eax,%eax
> ffffffff804b7217: 0 79 06 jns ffffffff804b721f <ip_queue_xmit+0x1da>
> ffffffff804b7219: 107464 8b 82 9c 00 00 00 mov 0x9c(%rdx),%eax
> ffffffff804b721f: 4963 88 43 08 mov %al,0x8(%rbx)
> ffffffff804b7222: 26297 8a 45 39 mov 0x39(%rbp),%al
> ffffffff804b7225: 76658 4d 85 ff test %r15,%r15
> ffffffff804b7228: 1712 88 43 09 mov %al,0x9(%rbx)
> ffffffff804b722b: 148 48 8b 44 24 58 mov 0x58(%rsp),%rax
> ffffffff804b7230: 2971 8b 80 20 01 00 00 mov 0x120(%rax),%eax
> ffffffff804b7236: 14849 89 43 0c mov %eax,0xc(%rbx)
> ffffffff804b7239: 84 48 8b 44 24 58 mov 0x58(%rsp),%rax
> ffffffff804b723e: 360 8b 80 1c 01 00 00 mov 0x11c(%rax),%eax
> ffffffff804b7244: 174 89 43 10 mov %eax,0x10(%rbx)
> ffffffff804b7247: 96 74 32 je ffffffff804b727b <ip_queue_xmit+0x236>
> ffffffff804b7249: 0 41 8a 57 04 mov 0x4(%r15),%dl
> ffffffff804b724d: 0 84 d2 test %dl,%dl
> ffffffff804b724f: 0 74 2a je ffffffff804b727b <ip_queue_xmit+0x236>
> ffffffff804b7251: 0 c0 ea 02 shr $0x2,%dl
> ffffffff804b7254: 0 03 13 add (%rbx),%edx
> ffffffff804b7256: 0 8a 03 mov (%rbx),%al
> ffffffff804b7258: 0 45 31 c0 xor %r8d,%r8d
> ffffffff804b725b: 0 4c 89 fe mov %r15,%rsi
> ffffffff804b725e: 0 4c 89 f7 mov %r14,%rdi
> ffffffff804b7261: 0 83 e0 f0 and $0xfffffffffffffff0,%eax
> ffffffff804b7264: 0 83 e2 0f and $0xf,%edx
> ffffffff804b7267: 0 09 d0 or %edx,%eax
> ffffffff804b7269: 0 88 03 mov %al,(%rbx)
> ffffffff804b726b: 0 48 8b 4c 24 58 mov 0x58(%rsp),%rcx
> ffffffff804b7270: 0 8b 95 30 02 00 00 mov 0x230(%rbp),%edx
> ffffffff804b7276: 0 e8 e4 d8 ff ff callq ffffffff804b4b5f <ip_options_build>
> ffffffff804b727b: 541 41 8b 86 c8 00 00 00 mov 0xc8(%r14),%eax
> ffffffff804b7282: 570 31 d2 xor %edx,%edx
> ffffffff804b7284: 0 49 03 86 d0 00 00 00 add 0xd0(%r14),%rax
> ffffffff804b728b: 34 8b 40 08 mov 0x8(%rax),%eax
> ffffffff804b728e: 496 66 85 c0 test %ax,%ax
> ffffffff804b7291: 11 74 06 je ffffffff804b7299 <ip_queue_xmit+0x254>
> ffffffff804b7293: 9 0f b7 c0 movzwl %ax,%eax
> ffffffff804b7296: 495 8d 50 ff lea -0x1(%rax),%edx
> ffffffff804b7299: 2 f6 43 06 40 testb $0x40,0x6(%rbx)
> ffffffff804b729d: 9 48 8b 74 24 58 mov 0x58(%rsp),%rsi
> ffffffff804b72a2: 497 74 34 je ffffffff804b72d8 <ip_queue_xmit+0x293>
> ffffffff804b72a4: 8 83 bd 30 02 00 00 00 cmpl $0x0,0x230(%rbp)
> ffffffff804b72ab: 10 74 23 je ffffffff804b72d0 <ip_queue_xmit+0x28b>
> ffffffff804b72ad: 1044 66 8b 85 52 02 00 00 mov 0x252(%rbp),%ax
> ffffffff804b72b4: 7 66 c1 c0 08 rol $0x8,%ax
> ffffffff804b72b8: 8 66 89 43 04 mov %ax,0x4(%rbx)
> ffffffff804b72bc: 432 66 8b 85 52 02 00 00 mov 0x252(%rbp),%ax
> ffffffff804b72c3: 9 ff c0 inc %eax
> ffffffff804b72c5: 14 01 d0 add %edx,%eax
> ffffffff804b72c7: 1141 66 89 85 52 02 00 00 mov %ax,0x252(%rbp)
> ffffffff804b72ce: 7 eb 10 jmp ffffffff804b72e0 <ip_queue_xmit+0x29b>
> ffffffff804b72d0: 0 66 c7 43 04 00 00 movw $0x0,0x4(%rbx)
> ffffffff804b72d6: 0 eb 08 jmp ffffffff804b72e0 <ip_queue_xmit+0x29b>
> ffffffff804b72d8: 0 48 89 df mov %rbx,%rdi
> ffffffff804b72db: 0 e8 b7 9d ff ff callq ffffffff804b1097 <__ip_select_ident>
> ffffffff804b72e0: 6 8b 85 54 01 00 00 mov 0x154(%rbp),%eax
> ffffffff804b72e6: 458 4c 89 f7 mov %r14,%rdi
> ffffffff804b72e9: 2 41 89 46 78 mov %eax,0x78(%r14)
> ffffffff804b72ed: 4 8b 85 f0 01 00 00 mov 0x1f0(%rbp),%eax
> ffffffff804b72f3: 841 41 89 86 b0 00 00 00 mov %eax,0xb0(%r14)
> ffffffff804b72fa: 11 e8 30 f2 ff ff callq ffffffff804b652f <ip_local_out>
> ffffffff804b72ff: 0 eb 44 jmp ffffffff804b7345 <ip_queue_xmit+0x300>
> ffffffff804b7301: 0 65 48 8b 04 25 10 00 mov %gs:0x10,%rax
> ffffffff804b7308: 0 00 00
> ffffffff804b730a: 0 8b 80 48 e0 ff ff mov -0x1fb8(%rax),%eax
> ffffffff804b7310: 0 4c 89 f7 mov %r14,%rdi
> ffffffff804b7313: 0 30 c0 xor %al,%al
> ffffffff804b7315: 0 66 83 f8 01 cmp $0x1,%ax
> ffffffff804b7319: 0 48 19 c0 sbb %rax,%rax
> ffffffff804b731c: 0 83 e0 08 and $0x8,%eax
> ffffffff804b731f: 0 48 8b 90 a8 16 ab 80 mov -0x7f54e958(%rax),%rdx
> ffffffff804b7326: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax
> ffffffff804b732d: 0 00
> ffffffff804b732e: 0 89 c0 mov %eax,%eax
> ffffffff804b7330: 0 48 f7 d2 not %rdx
> ffffffff804b7333: 0 48 8b 04 c2 mov (%rdx,%rax,8),%rax
> ffffffff804b7337: 0 48 ff 40 68 incq 0x68(%rax)
> ffffffff804b733b: 0 e8 b1 18 fd ff callq ffffffff80488bf1 <kfree_skb>
> ffffffff804b7340: 0 b8 8f ff ff ff mov $0xffffff8f,%eax
> ffffffff804b7345: 9196 48 83 c4 68 add $0x68,%rsp
> ffffffff804b7349: 892 5b pop %rbx
> ffffffff804b734a: 0 5d pop %rbp
> ffffffff804b734b: 488 41 5c pop %r12
> ffffffff804b734d: 0 41 5d pop %r13
> ffffffff804b734f: 0 41 5e pop %r14
> ffffffff804b7351: 513 41 5f pop %r15
> ffffffff804b7353: 0 c3 retq
>
> about 10% of this function's cost is artificial:
>
> ffffffff804b7045: 1001 <ip_queue_xmit>:
> ffffffff804b7045: 1001 41 57 push %r15
> ffffffff804b7047: 36698 41 56 push %r14
>
> there are profiler hits that leaked in via out-of-order execution from
> the callsites. The callsites are hard to map unfortunately, as this
> function is called via function pointers.
>
> the most likely callsite is tcp_transmit_skb().
>
> 30% of the overhead of this function comes from:
>
> ffffffff804b7203: 0 66 c7 43 06 00 00 movw $0x0,0x6(%rbx)
> ffffffff804b7209: 118 0f bf 85 40 02 00 00 movswl 0x240(%rbp),%eax
> ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx
> ffffffff804b7215: 340 85 c0 test %eax,%eax
> ffffffff804b7217: 0 79 06 jns ffffffff804b721f <ip_queue_xmit+0x1da>
> ffffffff804b7219: 107464 8b 82 9c 00 00 00 mov 0x9c(%rdx),%eax
> ffffffff804b721f: 4963 88 43 08 mov %al,0x8(%rbx)
>
> the 16-bit movw looks a bit weird. It comes from line 372:
>
> 0xffffffff804b7203 is in ip_queue_xmit (net/ipv4/ip_output.c:372).
> 367 iph = ip_hdr(skb);
> 368 *((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
> 369 if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok)
> 370 iph->frag_off = htons(IP_DF);
> 371 else
> 372 iph->frag_off = 0;
> 373 iph->ttl = ip_select_ttl(inet, &rt->u.dst);
> 374 iph->protocol = sk->sk_protocol;
> 375 iph->saddr = rt->rt_src;
> 376 iph->daddr = rt->rt_dst;
>
> the ip-header fragment flag setting to zero.
>
> 16-bit ops are an on-off love/hate affair on x86 CPUs. The trend is
> towards eliminating them as much as possible.
>
> _But_, the real overhead probably comes from:
>
> ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx
>
> which is the next line, the ttl field:
>
> 373 iph->ttl = ip_select_ttl(inet, &rt->u.dst);
>
> this shows that we are doing a hard cachemiss on the net-localhost
> route dst structure cacheline. We do a plain load instruction from it
> here and get a hefty cachemiss. (because 16 CPUs are banging on that
> single route)
>
> And let make sure we see this in perspective as well: that single
> cachemiss is _1.0 percent_ of the total tbench cost. (!) We could make
> the scheduler 10% slower straight away and it would have less of a
> real-life effect than this single iph->ttl field setting.
>

If you applied my patch against dst_entry, then you should not have any cache
line miss accessing the first and second cache line of dst_entry, that are mostly
read (and contains all metrics, like ttl at offset 0x58 ). Or something is
really wrong...

Now if your cpu cache is blown away because of the huge send()/receive() done
by tbench, we are stuck of course.

I dont know what you want to prove here. We already have one dst_entry per route in
the rt cache, and it already can consume *lot* of ram if you have 1 million entries
in rt cache.

tbench is mostly a network benchmark (and one using loopback device), thats not a
suprise it can stress network part or the kernel :)


2008-11-17 20:58:39

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Linus Torvalds <[email protected]>
Date: Mon, 17 Nov 2008 12:30:00 -0800 (PST)

> On Mon, 17 Nov 2008, David Miller wrote:
> >
> > It's on my workstation which is a much simpler 2 processor
> > UltraSPARC-IIIi (1.5Ghz) system.
>
> Ok. It could easily be something like a cache footprint issue. And while I
> don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is super-
> scalar but does no out-of-order and speculation, no?

I does only very simple speculation, but you're description is accurate.

> So I could easily see that the indirect branches in the scheduler
> hurt much more, and might explain why the x86 profile looks so
> different.

Right.

> One thing that non-NMI profiles also tend to show is "clumping", which in
> turn tends to rather excessively pinpoint code sequences that release the
> irq flag - just because those points show up in profiles, rather than
> being a spread-out-mush. So it's possible that Ingo's profile did show the
> scheduler more, but it was in the form of much more spread out "noise"
> rather than the single spike you saw.

Sure.

2008-11-17 21:01:46

by David Miller

[permalink] [raw]
Subject: Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Ingo Molnar <[email protected]>
Date: Mon, 17 Nov 2008 21:55:30 +0100

> and ouch does that global dec on &nf_bridge->use hurt!

nf_bridge should always be NULL on your system

2008-11-17 21:09:40

by Eric Dumazet

[permalink] [raw]
Subject: Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Ingo Molnar a ?crit :
> (gdb) list *0xffffffff8048942e
> 0xffffffff8048942e is in skb_release_head_state (include/linux/skbuff.h:1783).
> 1778 }
> 1779 #endif
> 1780 #ifdef CONFIG_BRIDGE_NETFILTER
> 1781 static inline void nf_bridge_put(struct nf_bridge_info *nf_bridge)
> 1782 {
> 1783 if (nf_bridge && atomic_dec_and_test(&nf_bridge->use))
> 1784 kfree(nf_bridge);
> 1785 }
> 1786 static inline void nf_bridge_get(struct nf_bridge_info *nf_bridge)
> 1787 {
>
> and ouch does that global dec on &nf_bridge->use hurt!
>
> i do have:
>
> CONFIG_BRIDGE_NETFILTER=y
>
> (this is a Fedora distro kernel derived .config)

Hum, you also should hit this cache line at atomic_inc() site then...

Strange, I never caught this one.

2008-11-17 21:10:49

by Ingo Molnar

[permalink] [raw]
Subject: tcp_ack(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

> 100.000000 total
> ................
> 1.997533 tcp_ack

hits (total: 199753)
.........
ffffffff804c0b17: 452 <tcp_ack>:
ffffffff804c0b17: 452 41 57 push %r15
ffffffff804c0b19: 9569 41 56 push %r14
ffffffff804c0b1b: 0 41 55 push %r13
ffffffff804c0b1d: 0 49 89 f5 mov %rsi,%r13
ffffffff804c0b20: 493 41 54 push %r12
ffffffff804c0b22: 104 41 89 d4 mov %edx,%r12d
ffffffff804c0b25: 0 55 push %rbp
ffffffff804c0b26: 425 48 89 fd mov %rdi,%rbp
ffffffff804c0b29: 21 53 push %rbx
ffffffff804c0b2a: 0 48 81 ec 88 00 00 00 sub $0x88,%rsp
ffffffff804c0b31: 445 8b 87 00 04 00 00 mov 0x400(%rdi),%eax
ffffffff804c0b37: 0 89 44 24 18 mov %eax,0x18(%rsp)
ffffffff804c0b3b: 443 48 8d 46 38 lea 0x38(%rsi),%rax
ffffffff804c0b3f: 18 8b 50 28 mov 0x28(%rax),%edx
ffffffff804c0b42: 2565 44 8b 70 18 mov 0x18(%rax),%r14d
ffffffff804c0b46: 358 89 54 24 1c mov %edx,0x1c(%rsp)
ffffffff804c0b4a: 2 39 97 fc 03 00 00 cmp %edx,0x3fc(%rdi)
ffffffff804c0b50: 368 0f 88 af 13 00 00 js ffffffff804c1f05 <tcp_ack+0x13ee>
ffffffff804c0b56: 106 89 d1 mov %edx,%ecx
ffffffff804c0b58: 2 2b 4c 24 18 sub 0x18(%rsp),%ecx
ffffffff804c0b5c: 328 0f 88 83 13 00 00 js ffffffff804c1ee5 <tcp_ack+0x13ce>
ffffffff804c0b62: 1440 8b 44 24 18 mov 0x18(%rsp),%eax
ffffffff804c0b66: 2 29 d0 sub %edx,%eax
ffffffff804c0b68: 77 44 89 e2 mov %r12d,%edx
ffffffff804c0b6b: 398 89 c6 mov %eax,%esi
ffffffff804c0b6d: 3 80 ce 04 or $0x4,%dh
ffffffff804c0b70: 65 c1 ee 1f shr $0x1f,%esi
ffffffff804c0b73: 362 44 0f 45 e2 cmovne %edx,%r12d
ffffffff804c0b77: 1 83 3d ea 78 3f 00 00 cmpl $0x0,0x3f78ea(%rip) # ffffffff808b8468 <sysctl_tcp_abc>
ffffffff804c0b7e: 64 74 27 je ffffffff804c0ba7 <tcp_ack+0x90>
ffffffff804c0b80: 0 8a 87 78 03 00 00 mov 0x378(%rdi),%al
ffffffff804c0b86: 0 3c 01 cmp $0x1,%al
ffffffff804c0b88: 0 77 08 ja ffffffff804c0b92 <tcp_ack+0x7b>
ffffffff804c0b8a: 0 01 8f dc 04 00 00 add %ecx,0x4dc(%rdi)
ffffffff804c0b90: 0 eb 15 jmp ffffffff804c0ba7 <tcp_ack+0x90>
ffffffff804c0b92: 0 3c 04 cmp $0x4,%al
ffffffff804c0b94: 0 75 11 jne ffffffff804c0ba7 <tcp_ack+0x90>
ffffffff804c0b96: 0 8b 87 4c 04 00 00 mov 0x44c(%rdi),%eax
ffffffff804c0b9c: 0 39 c1 cmp %eax,%ecx
ffffffff804c0b9e: 0 0f 46 c1 cmovbe %ecx,%eax
ffffffff804c0ba1: 0 01 87 dc 04 00 00 add %eax,0x4dc(%rdi)
ffffffff804c0ba7: 377 8b 9d d4 04 00 00 mov 0x4d4(%rbp),%ebx
ffffffff804c0bad: 3672 41 f7 c4 00 01 00 00 test $0x100,%r12d
ffffffff804c0bb4: 282 89 5c 24 20 mov %ebx,0x20(%rsp)
ffffffff804c0bb8: 0 8b 85 74 04 00 00 mov 0x474(%rbp),%eax
ffffffff804c0bbe: 140 89 44 24 30 mov %eax,0x30(%rsp)
ffffffff804c0bc2: 7592 8b 95 d0 04 00 00 mov 0x4d0(%rbp),%edx
ffffffff804c0bc8: 1580 89 54 24 24 mov %edx,0x24(%rsp)
ffffffff804c0bcc: 3 8b 9d cc 04 00 00 mov 0x4cc(%rbp),%ebx
ffffffff804c0bd2: 58 89 5c 24 28 mov %ebx,0x28(%rsp)
ffffffff804c0bd6: 419 8b 85 78 04 00 00 mov 0x478(%rbp),%eax
ffffffff804c0bdc: 0 89 44 24 2c mov %eax,0x2c(%rsp)
ffffffff804c0be0: 65 75 4f jne ffffffff804c0c31 <tcp_ack+0x11a>
ffffffff804c0be2: 423 85 f6 test %esi,%esi
ffffffff804c0be4: 55 74 4b je ffffffff804c0c31 <tcp_ack+0x11a>
ffffffff804c0be6: 36 44 89 b5 40 04 00 00 mov %r14d,0x440(%rbp)
ffffffff804c0bed: 368 8b 54 24 1c mov 0x1c(%rsp),%edx
ffffffff804c0bf1: 4 41 83 cc 02 or $0x2,%r12d
ffffffff804c0bf5: 32 be 05 00 00 00 mov $0x5,%esi
ffffffff804c0bfa: 392 48 89 ef mov %rbp,%rdi
ffffffff804c0bfd: 4 89 95 00 04 00 00 mov %edx,0x400(%rbp)
ffffffff804c0c03: 3341 44 89 64 24 5c mov %r12d,0x5c(%rsp)
ffffffff804c0c08: 855 e8 98 dc ff ff callq ffffffff804be8a5 <tcp_ca_event>
ffffffff804c0c0d: 2018 48 8b 05 a4 0a 5f 00 mov 0x5f0aa4(%rip),%rax # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c0c14: 858 65 8b 14 25 24 00 00 mov %gs:0x24,%edx
ffffffff804c0c1b: 0 00
ffffffff804c0c1c: 0 89 d2 mov %edx,%edx
ffffffff804c0c1e: 0 48 f7 d0 not %rax
ffffffff804c0c21: 425 48 8b 04 d0 mov (%rax,%rdx,8),%rax
ffffffff804c0c25: 0 48 ff 80 e8 00 00 00 incq 0xe8(%rax)
ffffffff804c0c2c: 0 e9 1b 01 00 00 jmpq ffffffff804c0d4c <tcp_ack+0x235>
ffffffff804c0c31: 41 45 3b 75 54 cmp 0x54(%r13),%r14d
ffffffff804c0c35: 360 74 06 je ffffffff804c0c3d <tcp_ack+0x126>
ffffffff804c0c37: 1 41 83 cc 01 or $0x1,%r12d
ffffffff804c0c3b: 80 eb 1f jmp ffffffff804c0c5c <tcp_ack+0x145>
ffffffff804c0c3d: 1 48 8b 05 74 0a 5f 00 mov 0x5f0a74(%rip),%rax # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c0c44: 303 65 8b 14 25 24 00 00 mov %gs:0x24,%edx
ffffffff804c0c4b: 0 00
ffffffff804c0c4c: 56 89 d2 mov %edx,%edx
ffffffff804c0c4e: 0 48 f7 d0 not %rax
ffffffff804c0c51: 4 48 8b 04 d0 mov (%rax,%rdx,8),%rax
ffffffff804c0c55: 13 48 ff 80 e0 00 00 00 incq 0xe0(%rax)
ffffffff804c0c5c: 12 41 8b 95 b8 00 00 00 mov 0xb8(%r13),%edx
ffffffff804c0c63: 300 49 03 95 d0 00 00 00 add 0xd0(%r13),%rdx
ffffffff804c0c6a: 17 66 8b 42 0e mov 0xe(%rdx),%ax
ffffffff804c0c6e: 0 66 c1 c0 08 rol $0x8,%ax
ffffffff804c0c72: 22 f6 42 0d 02 testb $0x2,0xd(%rdx)
ffffffff804c0c76: 13 0f b7 d8 movzwl %ax,%ebx
ffffffff804c0c79: 0 75 0b jne ffffffff804c0c86 <tcp_ack+0x16f>
ffffffff804c0c7b: 26 8a 8d 9d 04 00 00 mov 0x49d(%rbp),%cl
ffffffff804c0c81: 343 83 e1 0f and $0xf,%ecx
ffffffff804c0c84: 0 d3 e3 shl %cl,%ebx
ffffffff804c0c86: 82 8b 74 24 1c mov 0x1c(%rsp),%esi
ffffffff804c0c8a: 18 44 89 f2 mov %r14d,%edx
ffffffff804c0c8d: 0 89 d9 mov %ebx,%ecx
ffffffff804c0c8f: 12 48 89 ef mov %rbp,%rdi
ffffffff804c0c92: 12 e8 47 e6 ff ff callq ffffffff804bf2de <tcp_may_update_window>
ffffffff804c0c97: 16 31 d2 xor %edx,%edx
ffffffff804c0c99: 66 85 c0 test %eax,%eax
ffffffff804c0c9b: 0 74 48 je ffffffff804c0ce5 <tcp_ack+0x1ce>
ffffffff804c0c9d: 12 39 9d 44 04 00 00 cmp %ebx,0x444(%rbp)
ffffffff804c0ca3: 29 44 89 b5 40 04 00 00 mov %r14d,0x440(%rbp)
ffffffff804c0caa: 0 74 34 je ffffffff804c0ce0 <tcp_ack+0x1c9>
ffffffff804c0cac: 7 89 9d 44 04 00 00 mov %ebx,0x444(%rbp)
ffffffff804c0cb2: 59 c7 85 ec 03 00 00 00 movl $0x0,0x3ec(%rbp)
ffffffff804c0cb9: 0 00 00 00
ffffffff804c0cbc: 0 48 89 ef mov %rbp,%rdi
ffffffff804c0cbf: 7 e8 13 e8 ff ff callq ffffffff804bf4d7 <tcp_fast_path_check>
ffffffff804c0cc4: 23 3b 9d 48 04 00 00 cmp 0x448(%rbp),%ebx
ffffffff804c0cca: 48 76 14 jbe ffffffff804c0ce0 <tcp_ack+0x1c9>
ffffffff804c0ccc: 0 8b b5 5c 03 00 00 mov 0x35c(%rbp),%esi
ffffffff804c0cd2: 0 89 9d 48 04 00 00 mov %ebx,0x448(%rbp)
ffffffff804c0cd8: 0 48 89 ef mov %rbp,%rdi
ffffffff804c0cdb: 0 e8 40 41 00 00 callq ffffffff804c4e20 <tcp_sync_mss>
ffffffff804c0ce0: 6 ba 02 00 00 00 mov $0x2,%edx
ffffffff804c0ce5: 141 8b 5c 24 1c mov 0x1c(%rsp),%ebx
ffffffff804c0ce9: 1 44 09 e2 or %r12d,%edx
ffffffff804c0cec: 3 89 9d 00 04 00 00 mov %ebx,0x400(%rbp)
ffffffff804c0cf2: 34 89 54 24 5c mov %edx,0x5c(%rsp)
ffffffff804c0cf6: 0 41 80 7d 5d 00 cmpb $0x0,0x5d(%r13)
ffffffff804c0cfb: 6 74 13 je ffffffff804c0d10 <tcp_ack+0x1f9>
ffffffff804c0cfd: 0 8b 54 24 18 mov 0x18(%rsp),%edx
ffffffff804c0d01: 0 4c 89 ee mov %r13,%rsi
ffffffff804c0d04: 0 48 89 ef mov %rbp,%rdi
ffffffff804c0d07: 0 e8 b4 f5 ff ff callq ffffffff804c02c0 <tcp_sacktag_write_queue>
ffffffff804c0d0c: 0 09 44 24 5c or %eax,0x5c(%rsp)
ffffffff804c0d10: 29 41 8b 85 b8 00 00 00 mov 0xb8(%r13),%eax
ffffffff804c0d17: 128 49 03 85 d0 00 00 00 add 0xd0(%r13),%rax
ffffffff804c0d1e: 0 8a 40 0d mov 0xd(%rax),%al
ffffffff804c0d21: 33 83 e0 42 and $0x42,%eax
ffffffff804c0d24: 0 3c 40 cmp $0x40,%al
ffffffff804c0d26: 0 75 17 jne ffffffff804c0d3f <tcp_ack+0x228>
ffffffff804c0d28: 0 8b 44 24 5c mov 0x5c(%rsp),%eax
ffffffff804c0d2c: 0 83 c8 40 or $0x40,%eax
ffffffff804c0d2f: 0 f6 85 7e 04 00 00 01 testb $0x1,0x47e(%rbp)
ffffffff804c0d36: 0 0f 44 44 24 5c cmove 0x5c(%rsp),%eax
ffffffff804c0d3b: 0 89 44 24 5c mov %eax,0x5c(%rsp)
ffffffff804c0d3f: 36 be 06 00 00 00 mov $0x6,%esi
ffffffff804c0d44: 167 48 89 ef mov %rbp,%rdi
ffffffff804c0d47: 1 e8 59 db ff ff callq ffffffff804be8a5 <tcp_ca_event>
ffffffff804c0d4c: 581 c7 85 48 01 00 00 00 movl $0x0,0x148(%rbp)
ffffffff804c0d53: 0 00 00 00
ffffffff804c0d56: 6076 c6 85 7d 03 00 00 00 movb $0x0,0x37d(%rbp)
ffffffff804c0d5d: 0 48 8b 05 1c 8b 3f 00 mov 0x3f8b1c(%rip),%rax # ffffffff808b9880 <jiffies>
ffffffff804c0d64: 443 89 85 08 04 00 00 mov %eax,0x408(%rbp)
ffffffff804c0d6a: 0 8b 85 74 04 00 00 mov 0x474(%rbp),%eax
ffffffff804c0d70: 0 85 c0 test %eax,%eax
ffffffff804c0d72: 845 89 44 24 14 mov %eax,0x14(%rsp)
ffffffff804c0d76: 0 0f 84 fb 10 00 00 je ffffffff804c1e77 <tcp_ack+0x1360>
ffffffff804c0d7c: 0 48 8b 05 fd 8a 3f 00 mov 0x3f8afd(%rip),%rax # ffffffff808b9880 <jiffies>
ffffffff804c0d83: 586 8b 54 24 14 mov 0x14(%rsp),%edx
ffffffff804c0d87: 1 41 83 cc ff or $0xffffffffffffffff,%r12d
ffffffff804c0d8b: 2 89 44 24 48 mov %eax,0x48(%rsp)
ffffffff804c0d8f: 879 89 54 24 34 mov %edx,0x34(%rsp)
ffffffff804c0d93: 1 8b 9d d0 04 00 00 mov 0x4d0(%rbp),%ebx
ffffffff804c0d99: 0 89 5c 24 40 mov %ebx,0x40(%rsp)
ffffffff804c0d9d: 889 e8 e2 e8 ff ff callq ffffffff804bf684 <net_invalid_timestamp>
ffffffff804c0da2: 0 48 89 44 24 08 mov %rax,0x8(%rsp)
ffffffff804c0da7: 16 48 8d 85 c0 00 00 00 lea 0xc0(%rbp),%rax
ffffffff804c0dae: 445 c7 44 24 44 01 00 00 movl $0x1,0x44(%rsp)
ffffffff804c0db5: 0 00
ffffffff804c0db6: 0 c7 44 24 50 00 00 00 movl $0x0,0x50(%rsp)
ffffffff804c0dbd: 0 00
ffffffff804c0dbe: 10 c7 44 24 38 00 00 00 movl $0x0,0x38(%rsp)
ffffffff804c0dc5: 0 00
ffffffff804c0dc6: 1308 44 89 64 24 4c mov %r12d,0x4c(%rsp)
ffffffff804c0dcb: 225 48 89 04 24 mov %rax,(%rsp)
ffffffff804c0dcf: 2 e9 8b 02 00 00 jmpq ffffffff804c105f <tcp_ack+0x548>
ffffffff804c0dd4: 488 4d 8d 7d 38 lea 0x38(%r13),%r15
ffffffff804c0dd8: 2298 41 8a 57 25 mov 0x25(%r15),%dl
ffffffff804c0ddc: 0 88 54 24 3f mov %dl,0x3f(%rsp)
ffffffff804c0de0: 6 41 8b 77 1c mov 0x1c(%r15),%esi
ffffffff804c0de4: 455 8b 95 00 04 00 00 mov 0x400(%rbp),%edx
ffffffff804c0dea: 3 49 8b 8d d0 00 00 00 mov 0xd0(%r13),%rcx
ffffffff804c0df1: 0 41 8b 85 c8 00 00 00 mov 0xc8(%r13),%eax
ffffffff804c0df8: 440 39 f2 cmp %esi,%edx
ffffffff804c0dfa: 0 79 6f jns ffffffff804c0e6b <tcp_ack+0x354>
ffffffff804c0dfc: 0 89 c0 mov %eax,%eax
ffffffff804c0dfe: 39 8b 5c 08 08 mov 0x8(%rax,%rcx,1),%ebx
ffffffff804c0e02: 0 66 83 fb 01 cmp $0x1,%bx
ffffffff804c0e06: 2 0f 84 77 02 00 00 je ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0e0c: 0 41 8b 47 18 mov 0x18(%r15),%eax
ffffffff804c0e10: 0 39 d0 cmp %edx,%eax
ffffffff804c0e12: 0 0f 89 6b 02 00 00 jns ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0e18: 0 29 c2 sub %eax,%edx
ffffffff804c0e1a: 0 4c 89 ee mov %r13,%rsi
ffffffff804c0e1d: 0 48 89 ef mov %rbp,%rdi
ffffffff804c0e20: 0 e8 8f 4f 00 00 callq ffffffff804c5db4 <tcp_trim_head>
ffffffff804c0e25: 0 85 c0 test %eax,%eax
ffffffff804c0e27: 0 0f 85 56 02 00 00 jne ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0e2d: 0 41 8b 85 c8 00 00 00 mov 0xc8(%r13),%eax
ffffffff804c0e34: 0 0f b7 d3 movzwl %bx,%edx
ffffffff804c0e37: 0 49 03 85 d0 00 00 00 add 0xd0(%r13),%rax
ffffffff804c0e3e: 0 41 89 d6 mov %edx,%r14d
ffffffff804c0e41: 0 8b 48 08 mov 0x8(%rax),%ecx
ffffffff804c0e44: 0 0f b7 c1 movzwl %cx,%eax
ffffffff804c0e47: 0 41 29 c6 sub %eax,%r14d
ffffffff804c0e4a: 0 0f 84 33 02 00 00 je ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0e50: 0 66 85 c9 test %cx,%cx
ffffffff804c0e53: 0 75 04 jne ffffffff804c0e59 <tcp_ack+0x342>
ffffffff804c0e55: 0 0f 0b ud2a
ffffffff804c0e57: 0 eb fe jmp ffffffff804c0e57 <tcp_ack+0x340>
ffffffff804c0e59: 0 41 8b 5f 1c mov 0x1c(%r15),%ebx
ffffffff804c0e5d: 0 41 39 5f 18 cmp %ebx,0x18(%r15)
ffffffff804c0e61: 0 0f 88 d6 10 00 00 js ffffffff804c1f3d <tcp_ack+0x1426>
ffffffff804c0e67: 0 0f 0b ud2a
ffffffff804c0e69: 0 eb fe jmp ffffffff804c0e69 <tcp_ack+0x352>
ffffffff804c0e6b: 0 83 7c 24 44 00 cmpl $0x0,0x44(%rsp)
ffffffff804c0e70: 6326 89 c0 mov %eax,%eax
ffffffff804c0e72: 348 44 0f b7 74 08 08 movzwl 0x8(%rax,%rcx,1),%r14d
ffffffff804c0e78: 0 0f 84 8f 00 00 00 je ffffffff804c0f0d <tcp_ack+0x3f6>
ffffffff804c0e7e: 132 83 bd a4 03 00 00 00 cmpl $0x0,0x3a4(%rbp)
ffffffff804c0e85: 5840 0f 84 82 00 00 00 je ffffffff804c0f0d <tcp_ack+0x3f6>
ffffffff804c0e8b: 0 3b b5 b4 05 00 00 cmp 0x5b4(%rbp),%esi
ffffffff804c0e91: 0 78 7a js ffffffff804c0f0d <tcp_ack+0x3f6>
ffffffff804c0e93: 0 48 89 ef mov %rbp,%rdi
ffffffff804c0e96: 0 e8 21 da ff ff callq ffffffff804be8bc <tcp_current_ssthresh>
ffffffff804c0e9b: 0 8b b5 4c 04 00 00 mov 0x44c(%rbp),%esi
ffffffff804c0ea1: 0 44 8b a5 ac 04 00 00 mov 0x4ac(%rbp),%r12d
ffffffff804c0ea8: 0 48 89 ef mov %rbp,%rdi
ffffffff804c0eab: 0 89 85 6c 05 00 00 mov %eax,0x56c(%rbp)
ffffffff804c0eb1: 0 e8 c7 3e 00 00 callq ffffffff804c4d7d <tcp_mss_to_mtu>
ffffffff804c0eb6: 0 8b 9d a4 03 00 00 mov 0x3a4(%rbp),%ebx
ffffffff804c0ebc: 0 31 d2 xor %edx,%edx
ffffffff804c0ebe: 0 c7 85 b0 04 00 00 00 movl $0x0,0x4b0(%rbp)
ffffffff804c0ec5: 0 00 00 00
ffffffff804c0ec8: 0 41 0f af c4 imul %r12d,%eax
ffffffff804c0ecc: 0 48 89 ef mov %rbp,%rdi
ffffffff804c0ecf: 0 f7 f3 div %ebx
ffffffff804c0ed1: 0 89 85 ac 04 00 00 mov %eax,0x4ac(%rbp)
ffffffff804c0ed7: 0 48 8b 05 a2 89 3f 00 mov 0x3f89a2(%rip),%rax # ffffffff808b9880 <jiffies>
ffffffff804c0ede: 0 89 85 bc 04 00 00 mov %eax,0x4bc(%rbp)
ffffffff804c0ee4: 0 e8 d3 d9 ff ff callq ffffffff804be8bc <tcp_current_ssthresh>
ffffffff804c0ee9: 0 8b b5 5c 03 00 00 mov 0x35c(%rbp),%esi
ffffffff804c0eef: 0 89 85 54 04 00 00 mov %eax,0x454(%rbp)
ffffffff804c0ef5: 0 48 89 ef mov %rbp,%rdi
ffffffff804c0ef8: 0 89 9d a0 03 00 00 mov %ebx,0x3a0(%rbp)
ffffffff804c0efe: 0 c7 85 a4 03 00 00 00 movl $0x0,0x3a4(%rbp)
ffffffff804c0f05: 0 00 00 00
ffffffff804c0f08: 0 e8 13 3f 00 00 callq ffffffff804c4e20 <tcp_sync_mss>
ffffffff804c0f0d: 945 0f b6 44 24 3f movzbl 0x3f(%rsp),%eax
ffffffff804c0f12: 6361 a8 82 test $0x82,%al
ffffffff804c0f14: 0 74 30 je ffffffff804c0f46 <tcp_ack+0x42f>
ffffffff804c0f16: 0 a8 02 test $0x2,%al
ffffffff804c0f18: 0 74 07 je ffffffff804c0f21 <tcp_ack+0x40a>
ffffffff804c0f1a: 0 44 29 b5 78 04 00 00 sub %r14d,0x478(%rbp)
ffffffff804c0f21: 0 83 4c 24 50 08 orl $0x8,0x50(%rsp)
ffffffff804c0f26: 0 f6 44 24 50 04 testb $0x4,0x50(%rsp)
ffffffff804c0f2b: 0 75 06 jne ffffffff804c0f33 <tcp_ack+0x41c>
ffffffff804c0f2d: 0 41 83 fe 01 cmp $0x1,%r14d
ffffffff804c0f31: 0 76 08 jbe ffffffff804c0f3b <tcp_ack+0x424>
ffffffff804c0f33: 0 81 4c 24 50 00 10 00 orl $0x1000,0x50(%rsp)
ffffffff804c0f3a: 0 00
ffffffff804c0f3b: 0 41 83 cc ff or $0xffffffffffffffff,%r12d
ffffffff804c0f3f: 0 44 89 64 24 4c mov %r12d,0x4c(%rsp)
ffffffff804c0f44: 0 eb 38 jmp ffffffff804c0f7e <tcp_ack+0x467>
ffffffff804c0f46: 0 44 8b 64 24 48 mov 0x48(%rsp),%r12d
ffffffff804c0f4b: 5837 45 2b 67 20 sub 0x20(%r15),%r12d
ffffffff804c0f4f: 1 83 7c 24 4c 00 cmpl $0x0,0x4c(%rsp)
ffffffff804c0f54: 167 8b 5c 24 4c mov 0x4c(%rsp),%ebx
ffffffff804c0f58: 514 49 8b 55 18 mov 0x18(%r13),%rdx
ffffffff804c0f5c: 0 41 0f 48 dc cmovs %r12d,%ebx
ffffffff804c0f60: 164 a8 01 test $0x1,%al
ffffffff804c0f62: 413 48 89 54 24 08 mov %rdx,0x8(%rsp)
ffffffff804c0f67: 0 89 5c 24 4c mov %ebx,0x4c(%rsp)
ffffffff804c0f6b: 148 75 11 jne ffffffff804c0f7e <tcp_ack+0x467>
ffffffff804c0f6d: 1608 8b 54 24 38 mov 0x38(%rsp),%edx
ffffffff804c0f71: 0 39 54 24 34 cmp %edx,0x34(%rsp)
ffffffff804c0f75: 272 0f 46 54 24 34 cmovbe 0x34(%rsp),%edx
ffffffff804c0f7a: 266 89 54 24 34 mov %edx,0x34(%rsp)
ffffffff804c0f7e: 0 a8 01 test $0x1,%al
ffffffff804c0f80: 164 74 07 je ffffffff804c0f89 <tcp_ack+0x472>
ffffffff804c0f82: 0 44 29 b5 d0 04 00 00 sub %r14d,0x4d0(%rbp)
ffffffff804c0f89: 3955 a8 04 test $0x4,%al
ffffffff804c0f8b: 8510 74 07 je ffffffff804c0f94 <tcp_ack+0x47d>
ffffffff804c0f8d: 0 44 29 b5 cc 04 00 00 sub %r14d,0x4cc(%rbp)
ffffffff804c0f94: 11 44 29 b5 74 04 00 00 sub %r14d,0x474(%rbp)
ffffffff804c0f9b: 1426 44 01 74 24 38 add %r14d,0x38(%rsp)
ffffffff804c0fa0: 6 41 f6 47 24 02 testb $0x2,0x24(%r15)
ffffffff804c0fa5: 548 75 07 jne ffffffff804c0fae <tcp_ack+0x497>
ffffffff804c0fa7: 2 83 4c 24 50 04 orl $0x4,0x50(%rsp)
ffffffff804c0fac: 0 eb 0f jmp ffffffff804c0fbd <tcp_ack+0x4a6>
ffffffff804c0fae: 0 83 4c 24 50 10 orl $0x10,0x50(%rsp)
ffffffff804c0fb3: 0 c7 85 74 05 00 00 00 movl $0x0,0x574(%rbp)
ffffffff804c0fba: 0 00 00 00
ffffffff804c0fbd: 517 83 7c 24 44 00 cmpl $0x0,0x44(%rsp)
ffffffff804c0fc2: 6012 0f 84 bb 00 00 00 je ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c0fc8: 1111 48 8b 34 24 mov (%rsp),%rsi
ffffffff804c0fcc: 0 4c 89 ef mov %r13,%rdi
ffffffff804c0fcf: 184 e8 0d d8 ff ff callq ffffffff804be7e1 <__skb_unlink>
ffffffff804c0fd4: 5 41 8b 45 68 mov 0x68(%r13),%eax
ffffffff804c0fd8: 517 05 e8 00 00 00 add $0xe8,%eax
ffffffff804c0fdd: 0 41 39 85 e0 00 00 00 cmp %eax,0xe0(%r13)
ffffffff804c0fe4: 31 7d 08 jge ffffffff804c0fee <tcp_ack+0x4d7>
ffffffff804c0fe6: 0 4c 89 ef mov %r13,%rdi
ffffffff804c0fe9: 0 e8 d4 66 fc ff callq ffffffff804876c2 <skb_truesize_bug>
ffffffff804c0fee: 1142 0f ba ad 10 01 00 00 btsl $0xe,0x110(%rbp)
ffffffff804c0ff5: 0 0e
ffffffff804c0ff6: 2576 8b 85 f0 00 00 00 mov 0xf0(%rbp),%eax
ffffffff804c0ffc: 433 41 2b 85 e0 00 00 00 sub 0xe0(%r13),%eax
ffffffff804c1003: 4843 89 85 f0 00 00 00 mov %eax,0xf0(%rbp)
ffffffff804c1009: 1730 48 8b 45 30 mov 0x30(%rbp),%rax
ffffffff804c100d: 311 41 8b 95 e0 00 00 00 mov 0xe0(%r13),%edx
ffffffff804c1014: 0 48 83 b8 b0 00 00 00 cmpq $0x0,0xb0(%rax)
ffffffff804c101b: 0 00
ffffffff804c101c: 418 74 06 je ffffffff804c1024 <tcp_ack+0x50d>
ffffffff804c101e: 37 01 95 f4 00 00 00 add %edx,0xf4(%rbp)
ffffffff804c1024: 2 4c 89 ef mov %r13,%rdi
ffffffff804c1027: 432 e8 56 7b fc ff callq ffffffff80488b82 <__kfree_skb>
ffffffff804c102c: 44 4c 3b ad f0 04 00 00 cmp 0x4f0(%rbp),%r13
ffffffff804c1033: 511 48 c7 85 e8 04 00 00 movq $0x0,0x4e8(%rbp)
ffffffff804c103a: 0 00 00 00 00
ffffffff804c103e: 1 75 0b jne ffffffff804c104b <tcp_ack+0x534>
ffffffff804c1040: 0 48 c7 85 f0 04 00 00 movq $0x0,0x4f0(%rbp)
ffffffff804c1047: 0 00 00 00 00
ffffffff804c104b: 0 4c 3b ad e0 04 00 00 cmp 0x4e0(%rbp),%r13
ffffffff804c1052: 518 75 0b jne ffffffff804c105f <tcp_ack+0x548>
ffffffff804c1054: 0 48 c7 85 e0 04 00 00 movq $0x0,0x4e0(%rbp)
ffffffff804c105b: 0 00 00 00 00
ffffffff804c105f: 439 4c 8b ad c0 00 00 00 mov 0xc0(%rbp),%r13
ffffffff804c1066: 5655 4c 3b 2c 24 cmp (%rsp),%r13
ffffffff804c106a: 0 75 05 jne ffffffff804c1071 <tcp_ack+0x55a>
ffffffff804c106c: 0 45 31 ed xor %r13d,%r13d
ffffffff804c106f: 810 eb 12 jmp ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c1071: 0 4d 85 ed test %r13,%r13
ffffffff804c1074: 2574 74 0d je ffffffff804c1083 <tcp_ack+0x56c>
ffffffff804c1076: 0 4c 3b ad d8 01 00 00 cmp 0x1d8(%rbp),%r13
ffffffff804c107d: 0 0f 85 51 fd ff ff jne ffffffff804c0dd4 <tcp_ack+0x2bd>
ffffffff804c1083: 454 8b 8d 00 04 00 00 mov 0x400(%rbp),%ecx
ffffffff804c1089: 497 8b 85 80 04 00 00 mov 0x480(%rbp),%eax
ffffffff804c108f: 0 2b 44 24 18 sub 0x18(%rsp),%eax
ffffffff804c1093: 0 89 ca mov %ecx,%edx
ffffffff804c1095: 534 2b 54 24 18 sub 0x18(%rsp),%edx
ffffffff804c1099: 0 39 c2 cmp %eax,%edx
ffffffff804c109b: 0 72 06 jb ffffffff804c10a3 <tcp_ack+0x58c>
ffffffff804c109d: 458 89 8d 80 04 00 00 mov %ecx,0x480(%rbp)
ffffffff804c10a3: 0 4d 85 ed test %r13,%r13
ffffffff804c10a6: 0 74 15 je ffffffff804c10bd <tcp_ack+0x5a6>
ffffffff804c10a8: 0 8b 44 24 50 mov 0x50(%rsp),%eax
ffffffff804c10ac: 2 80 cc 20 or $0x20,%ah
ffffffff804c10af: 3 41 f6 45 5d 01 testb $0x1,0x5d(%r13)
ffffffff804c10b4: 0 0f 44 44 24 50 cmove 0x50(%rsp),%eax
ffffffff804c10b9: 0 89 44 24 50 mov %eax,0x50(%rsp)
ffffffff804c10bd: 444 f6 44 24 50 14 testb $0x14,0x50(%rsp)
ffffffff804c10c2: 551 0f 84 e1 01 00 00 je ffffffff804c12a9 <tcp_ack+0x792>
ffffffff804c10c8: 1 f6 85 9c 04 00 00 01 testb $0x1,0x49c(%rbp)
ffffffff804c10cf: 2 48 8b 9d 60 03 00 00 mov 0x360(%rbp),%rbx
ffffffff804c10d6: 462 74 17 je ffffffff804c10ef <tcp_ack+0x5d8>
ffffffff804c10d8: 0 83 bd 98 04 00 00 00 cmpl $0x0,0x498(%rbp)
ffffffff804c10df: 0 74 0e je ffffffff804c10ef <tcp_ack+0x5d8>
ffffffff804c10e1: 451 8b 74 24 50 mov 0x50(%rsp),%esi
ffffffff804c10e5: 43 48 89 ef mov %rbp,%rdi
ffffffff804c10e8: 0 e8 ea e8 ff ff callq ffffffff804bf9d7 <tcp_ack_saw_tstamp>
ffffffff804c10ed: 66 eb 47 jmp ffffffff804c1136 <tcp_ack+0x61f>
ffffffff804c10ef: 0 83 7c 24 4c 00 cmpl $0x0,0x4c(%rsp)
ffffffff804c10f4: 0 78 40 js ffffffff804c1136 <tcp_ack+0x61f>
ffffffff804c10f6: 0 f6 44 24 50 08 testb $0x8,0x50(%rsp)
ffffffff804c10fb: 0 75 39 jne ffffffff804c1136 <tcp_ack+0x61f>
ffffffff804c10fd: 0 8b 74 24 4c mov 0x4c(%rsp),%esi
ffffffff804c1101: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1104: 0 e8 b5 e7 ff ff callq ffffffff804bf8be <tcp_rtt_estimator>
ffffffff804c1109: 0 8b 85 60 04 00 00 mov 0x460(%rbp),%eax
ffffffff804c110f: 0 c6 85 7b 03 00 00 00 movb $0x0,0x37b(%rbp)
ffffffff804c1116: 0 c1 e8 03 shr $0x3,%eax
ffffffff804c1119: 0 03 85 6c 04 00 00 add 0x46c(%rbp),%eax
ffffffff804c111f: 0 3d 30 75 00 00 cmp $0x7530,%eax
ffffffff804c1124: 0 89 85 58 03 00 00 mov %eax,0x358(%rbp)
ffffffff804c112a: 0 76 0a jbe ffffffff804c1136 <tcp_ack+0x61f>
ffffffff804c112c: 0 c7 85 58 03 00 00 30 movl $0x7530,0x358(%rbp)
ffffffff804c1133: 0 75 00 00
ffffffff804c1136: 732 83 bd 74 04 00 00 00 cmpl $0x0,0x474(%rbp)
ffffffff804c113d: 1833 75 0f jne ffffffff804c114e <tcp_ack+0x637>
ffffffff804c113f: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff804c1144: 493 48 89 ef mov %rbp,%rdi
ffffffff804c1147: 0 e8 07 d7 ff ff callq ffffffff804be853 <inet_csk_clear_xmit_timer>
ffffffff804c114c: 0 eb 18 jmp ffffffff804c1166 <tcp_ack+0x64f>
ffffffff804c114e: 0 8b 95 58 03 00 00 mov 0x358(%rbp),%edx
ffffffff804c1154: 0 b9 30 75 00 00 mov $0x7530,%ecx
ffffffff804c1159: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff804c115e: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1161: 0 e8 7d e4 ff ff callq ffffffff804bf5e3 <inet_csk_reset_xmit_timer>
ffffffff804c1166: 881 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c116c: 845 c0 e8 04 shr $0x4,%al
ffffffff804c116f: 1 75 63 jne ffffffff804c11d4 <tcp_ack+0x6bd>
ffffffff804c1171: 0 83 7c 24 38 00 cmpl $0x0,0x38(%rsp)
ffffffff804c1176: 0 7e 29 jle ffffffff804c11a1 <tcp_ack+0x68a>
ffffffff804c1178: 0 8b 44 24 38 mov 0x38(%rsp),%eax
ffffffff804c117c: 0 8b 95 d0 04 00 00 mov 0x4d0(%rbp),%edx
ffffffff804c1182: 0 ff c8 dec %eax
ffffffff804c1184: 0 39 d0 cmp %edx,%eax
ffffffff804c1186: 0 72 0c jb ffffffff804c1194 <tcp_ack+0x67d>
ffffffff804c1188: 0 c7 85 d0 04 00 00 00 movl $0x0,0x4d0(%rbp)
ffffffff804c118f: 0 00 00 00
ffffffff804c1192: 0 eb 0d jmp ffffffff804c11a1 <tcp_ack+0x68a>
ffffffff804c1194: 0 8d 42 01 lea 0x1(%rdx),%eax
ffffffff804c1197: 0 2b 44 24 38 sub 0x38(%rsp),%eax
ffffffff804c119b: 0 89 85 d0 04 00 00 mov %eax,0x4d0(%rbp)
ffffffff804c11a1: 0 8b 74 24 38 mov 0x38(%rsp),%esi
ffffffff804c11a5: 0 48 89 ef mov %rbp,%rdi
ffffffff804c11a8: 0 e8 2d dd ff ff callq ffffffff804beeda <tcp_check_reno_reordering>
ffffffff804c11ad: 0 8b 85 cc 04 00 00 mov 0x4cc(%rbp),%eax
ffffffff804c11b3: 0 03 85 d0 04 00 00 add 0x4d0(%rbp),%eax
ffffffff804c11b9: 0 3b 85 74 04 00 00 cmp 0x474(%rbp),%eax
ffffffff804c11bf: 0 76 5e jbe ffffffff804c121f <tcp_ack+0x708>
ffffffff804c11c1: 0 be b0 06 00 00 mov $0x6b0,%esi
ffffffff804c11c6: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi
ffffffff804c11cd: 0 e8 e3 4f d7 ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804c11d2: 0 eb 4b jmp ffffffff804c121f <tcp_ack+0x708>
ffffffff804c11d4: 414 8b 44 24 20 mov 0x20(%rsp),%eax
ffffffff804c11d8: 1591 39 44 24 34 cmp %eax,0x34(%rsp)
ffffffff804c11dc: 2 73 14 jae ffffffff804c11f2 <tcp_ack+0x6db>
ffffffff804c11de: 0 8b b5 d4 04 00 00 mov 0x4d4(%rbp),%esi
ffffffff804c11e4: 0 2b 74 24 34 sub 0x34(%rsp),%esi
ffffffff804c11e8: 0 31 d2 xor %edx,%edx
ffffffff804c11ea: 0 48 89 ef mov %rbp,%rdi
ffffffff804c11ed: 0 e8 9c db ff ff callq ffffffff804bed8e <tcp_update_reordering>
ffffffff804c11f2: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c11f8: 865 c0 e8 04 shr $0x4,%al
ffffffff804c11fb: 3 a8 02 test $0x2,%al
ffffffff804c11fd: 0 8b 85 60 05 00 00 mov 0x560(%rbp),%eax
ffffffff804c1203: 453 74 06 je ffffffff804c120b <tcp_ack+0x6f4>
ffffffff804c1205: 8 2b 44 24 38 sub 0x38(%rsp),%eax
ffffffff804c1209: 0 eb 0e jmp ffffffff804c1219 <tcp_ack+0x702>
ffffffff804c120b: 0 8b 95 d0 04 00 00 mov 0x4d0(%rbp),%edx
ffffffff804c1211: 0 29 54 24 40 sub %edx,0x40(%rsp)
ffffffff804c1215: 0 2b 44 24 40 sub 0x40(%rsp),%eax
ffffffff804c1219: 423 89 85 60 05 00 00 mov %eax,0x560(%rbp)
ffffffff804c121f: 492 8b 85 d4 04 00 00 mov 0x4d4(%rbp),%eax
ffffffff804c1225: 489 39 44 24 38 cmp %eax,0x38(%rsp)
ffffffff804c1229: 0 8b 54 24 38 mov 0x38(%rsp),%edx
ffffffff804c122d: 0 0f 47 d0 cmova %eax,%edx
ffffffff804c1230: 438 29 d0 sub %edx,%eax
ffffffff804c1232: 0 89 85 d4 04 00 00 mov %eax,0x4d4(%rbp)
ffffffff804c1238: 1 48 83 7b 58 00 cmpq $0x0,0x58(%rbx)
ffffffff804c123d: 446 74 6a je ffffffff804c12a9 <tcp_ack+0x792>
ffffffff804c123f: 0 f6 44 24 50 08 testb $0x8,0x50(%rsp)
ffffffff804c1244: 3 75 54 jne ffffffff804c129a <tcp_ack+0x783>
ffffffff804c1246: 441 f6 43 10 02 testb $0x2,0x10(%rbx)
ffffffff804c124a: 8 74 3f je ffffffff804c128b <tcp_ack+0x774>
ffffffff804c124c: 0 e8 33 e4 ff ff callq ffffffff804bf684 <net_invalid_timestamp>
ffffffff804c1251: 0 48 39 44 24 08 cmp %rax,0x8(%rsp)
ffffffff804c1256: 0 74 33 je ffffffff804c128b <tcp_ack+0x774>
ffffffff804c1258: 0 e8 17 8b d8 ff callq ffffffff80249d74 <ktime_get_real>
ffffffff804c125d: 0 48 89 c7 mov %rax,%rdi
ffffffff804c1260: 0 48 2b 7c 24 08 sub 0x8(%rsp),%rdi
ffffffff804c1265: 0 e8 e3 8e d7 ff callq ffffffff8023a14d <ns_to_timeval>
ffffffff804c126a: 0 48 89 44 24 60 mov %rax,0x60(%rsp)
ffffffff804c126f: 0 48 89 44 24 70 mov %rax,0x70(%rsp)
ffffffff804c1274: 0 48 69 c0 40 42 0f 00 imul $0xf4240,%rax,%rax
ffffffff804c127b: 0 48 89 54 24 78 mov %rdx,0x78(%rsp)
ffffffff804c1280: 0 48 89 54 24 68 mov %rdx,0x68(%rsp)
ffffffff804c1285: 0 03 44 24 78 add 0x78(%rsp),%eax
ffffffff804c1289: 0 eb 12 jmp ffffffff804c129d <tcp_ack+0x786>
ffffffff804c128b: 89 45 85 e4 test %r12d,%r12d
ffffffff804c128e: 414 7e 0a jle ffffffff804c129a <tcp_ack+0x783>
ffffffff804c1290: 0 49 63 fc movslq %r12d,%rdi
ffffffff804c1293: 65 e8 a8 8b d7 ff callq ffffffff80239e40 <jiffies_to_usecs>
ffffffff804c1298: 0 eb 03 jmp ffffffff804c129d <tcp_ack+0x786>
ffffffff804c129a: 0 83 c8 ff or $0xffffffffffffffff,%eax
ffffffff804c129d: 1136 89 c2 mov %eax,%edx
ffffffff804c129f: 7 8b 74 24 38 mov 0x38(%rsp),%esi
ffffffff804c12a3: 444 48 89 ef mov %rbp,%rdi
ffffffff804c12a6: 1 ff 53 58 callq *0x58(%rbx)
ffffffff804c12a9: 305 83 bd d0 04 00 00 00 cmpl $0x0,0x4d0(%rbp)
ffffffff804c12b0: 518 79 11 jns ffffffff804c12c3 <tcp_ack+0x7ac>
ffffffff804c12b2: 0 be ac 0b 00 00 mov $0xbac,%esi
ffffffff804c12b7: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi
ffffffff804c12be: 0 e8 f2 4e d7 ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804c12c3: 415 83 bd cc 04 00 00 00 cmpl $0x0,0x4cc(%rbp)
ffffffff804c12ca: 2204 79 11 jns ffffffff804c12dd <tcp_ack+0x7c6>
ffffffff804c12cc: 0 be ad 0b 00 00 mov $0xbad,%esi
ffffffff804c12d1: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi
ffffffff804c12d8: 0 e8 d8 4e d7 ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804c12dd: 0 83 bd 78 04 00 00 00 cmpl $0x0,0x478(%rbp)
ffffffff804c12e4: 1747 79 11 jns ffffffff804c12f7 <tcp_ack+0x7e0>
ffffffff804c12e6: 0 be ae 0b 00 00 mov $0xbae,%esi
ffffffff804c12eb: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi
ffffffff804c12f2: 0 e8 be 4e d7 ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804c12f7: 0 83 bd 74 04 00 00 00 cmpl $0x0,0x474(%rbp)
ffffffff804c12fe: 878 0f 85 86 00 00 00 jne ffffffff804c138a <tcp_ack+0x873>
ffffffff804c1304: 4721 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c130a: 968 c0 e8 04 shr $0x4,%al
ffffffff804c130d: 2 74 7b je ffffffff804c138a <tcp_ack+0x873>
ffffffff804c130f: 171 8b b5 cc 04 00 00 mov 0x4cc(%rbp),%esi
ffffffff804c1315: 282 85 f6 test %esi,%esi
ffffffff804c1317: 0 74 1f je ffffffff804c1338 <tcp_ack+0x821>
ffffffff804c1319: 0 0f b6 95 78 03 00 00 movzbl 0x378(%rbp),%edx
ffffffff804c1320: 0 48 c7 c7 b2 d9 6a 80 mov $0xffffffff806ad9b2,%rdi
ffffffff804c1327: 0 31 c0 xor %eax,%eax
ffffffff804c1329: 0 e8 46 5a d7 ff callq ffffffff80236d74 <printk>
ffffffff804c132e: 0 c7 85 cc 04 00 00 00 movl $0x0,0x4cc(%rbp)
ffffffff804c1335: 0 00 00 00
ffffffff804c1338: 198 8b b5 d0 04 00 00 mov 0x4d0(%rbp),%esi
ffffffff804c133e: 257 85 f6 test %esi,%esi
ffffffff804c1340: 0 74 1f je ffffffff804c1361 <tcp_ack+0x84a>
ffffffff804c1342: 0 0f b6 95 78 03 00 00 movzbl 0x378(%rbp),%edx
ffffffff804c1349: 0 48 c7 c7 c3 d9 6a 80 mov $0xffffffff806ad9c3,%rdi
ffffffff804c1350: 0 31 c0 xor %eax,%eax
ffffffff804c1352: 0 e8 1d 5a d7 ff callq ffffffff80236d74 <printk>
ffffffff804c1357: 0 c7 85 d0 04 00 00 00 movl $0x0,0x4d0(%rbp)
ffffffff804c135e: 0 00 00 00
ffffffff804c1361: 2524 8b b5 78 04 00 00 mov 0x478(%rbp),%esi
ffffffff804c1367: 1825 85 f6 test %esi,%esi
ffffffff804c1369: 0 74 1f je ffffffff804c138a <tcp_ack+0x873>
ffffffff804c136b: 0 0f b6 95 78 03 00 00 movzbl 0x378(%rbp),%edx
ffffffff804c1372: 0 48 c7 c7 d4 d9 6a 80 mov $0xffffffff806ad9d4,%rdi
ffffffff804c1379: 0 31 c0 xor %eax,%eax
ffffffff804c137b: 0 e8 f4 59 d7 ff callq ffffffff80236d74 <printk>
ffffffff804c1380: 0 c7 85 78 04 00 00 00 movl $0x0,0x478(%rbp)
ffffffff804c1387: 0 00 00 00
ffffffff804c138a: 46 44 8b 64 24 50 mov 0x50(%rsp),%r12d
ffffffff804c138f: 7369 31 c9 xor %ecx,%ecx
ffffffff804c1391: 348 44 0b 64 24 5c or 0x5c(%rsp),%r12d
ffffffff804c1396: 0 80 bd 5e 04 00 00 00 cmpb $0x0,0x45e(%rbp)
ffffffff804c139d: 96 0f 84 26 02 00 00 je ffffffff804c15c9 <tcp_ack+0xab2>
ffffffff804c13a3: 0 8b 85 cc 04 00 00 mov 0x4cc(%rbp),%eax
ffffffff804c13a9: 0 03 85 d0 04 00 00 add 0x4d0(%rbp),%eax
ffffffff804c13af: 0 3b 85 74 04 00 00 cmp 0x474(%rbp),%eax
ffffffff804c13b5: 0 76 11 jbe ffffffff804c13c8 <tcp_ack+0x8b1>
ffffffff804c13b7: 0 be 58 0c 00 00 mov $0xc58,%esi
ffffffff804c13bc: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi
ffffffff804c13c3: 0 e8 ed 4d d7 ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804c13c8: 0 44 89 e3 mov %r12d,%ebx
ffffffff804c13cb: 0 83 e3 04 and $0x4,%ebx
ffffffff804c13ce: 0 74 07 je ffffffff804c13d7 <tcp_ack+0x8c0>
ffffffff804c13d0: 0 c6 85 79 03 00 00 00 movb $0x0,0x379(%rbp)
ffffffff804c13d7: 0 41 f7 c4 00 10 00 00 test $0x1000,%r12d
ffffffff804c13de: 0 75 0f jne ffffffff804c13ef <tcp_ack+0x8d8>
ffffffff804c13e0: 0 80 bd 5e 04 00 00 01 cmpb $0x1,0x45e(%rbp)
ffffffff804c13e7: 0 76 10 jbe ffffffff804c13f9 <tcp_ack+0x8e2>
ffffffff804c13e9: 0 41 f6 c4 08 test $0x8,%r12b
ffffffff804c13ed: 0 74 0a je ffffffff804c13f9 <tcp_ack+0x8e2>
ffffffff804c13ef: 0 c7 85 78 05 00 00 00 movl $0x0,0x578(%rbp)
ffffffff804c13f6: 0 00 00 00
ffffffff804c13f9: 0 8b 85 58 04 00 00 mov 0x458(%rbp),%eax
ffffffff804c13ff: 0 39 85 00 04 00 00 cmp %eax,0x400(%rbp)
ffffffff804c1405: 0 78 12 js ffffffff804c1419 <tcp_ack+0x902>
ffffffff804c1407: 0 31 f6 xor %esi,%esi
ffffffff804c1409: 0 80 bd 5e 04 00 00 01 cmpb $0x1,0x45e(%rbp)
ffffffff804c1410: 0 40 0f 95 c6 setne %sil
ffffffff804c1414: 0 83 c6 02 add $0x2,%esi
ffffffff804c1417: 0 eb 37 jmp ffffffff804c1450 <tcp_ack+0x939>
ffffffff804c1419: 0 48 89 ef mov %rbp,%rdi
ffffffff804c141c: 0 e8 e0 da ff ff callq ffffffff804bef01 <tcp_is_sackfrto>
ffffffff804c1421: 0 85 c0 test %eax,%eax
ffffffff804c1423: 0 75 3b jne ffffffff804c1460 <tcp_ack+0x949>
ffffffff804c1425: 0 41 f7 c4 34 04 00 00 test $0x434,%r12d
ffffffff804c142c: 0 75 0a jne ffffffff804c1438 <tcp_ack+0x921>
ffffffff804c142e: 0 41 f6 c4 17 test $0x17,%r12b
ffffffff804c1432: 0 0f 85 8c 01 00 00 jne ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c1438: 0 85 db test %ebx,%ebx
ffffffff804c143a: 0 0f 85 8d 00 00 00 jne ffffffff804c14cd <tcp_ack+0x9b6>
ffffffff804c1440: 0 31 f6 xor %esi,%esi
ffffffff804c1442: 0 80 bd 5e 04 00 00 01 cmpb $0x1,0x45e(%rbp)
ffffffff804c1449: 0 40 0f 95 c6 setne %sil
ffffffff804c144d: 0 8d 34 76 lea (%rsi,%rsi,2),%esi
ffffffff804c1450: 0 44 89 e2 mov %r12d,%edx
ffffffff804c1453: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1456: 0 e8 b8 e7 ff ff callq ffffffff804bfc13 <tcp_enter_frto_loss>
ffffffff804c145b: 0 e9 64 01 00 00 jmpq ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c1460: 0 85 db test %ebx,%ebx
ffffffff804c1462: 0 75 37 jne ffffffff804c149b <tcp_ack+0x984>
ffffffff804c1464: 0 80 bd 5e 04 00 00 01 cmpb $0x1,0x45e(%rbp)
ffffffff804c146b: 0 75 2e jne ffffffff804c149b <tcp_ack+0x984>
ffffffff804c146d: 0 8b 85 78 04 00 00 mov 0x478(%rbp),%eax
ffffffff804c1473: 0 03 85 74 04 00 00 add 0x474(%rbp),%eax
ffffffff804c1479: 0 2b 85 d0 04 00 00 sub 0x4d0(%rbp),%eax
ffffffff804c147f: 0 8b 95 ac 04 00 00 mov 0x4ac(%rbp),%edx
ffffffff804c1485: 0 2b 85 cc 04 00 00 sub 0x4cc(%rbp),%eax
ffffffff804c148b: 0 39 d0 cmp %edx,%eax
ffffffff804c148d: 0 0f 47 c2 cmova %edx,%eax
ffffffff804c1490: 0 89 85 ac 04 00 00 mov %eax,0x4ac(%rbp)
ffffffff804c1496: 0 e9 29 01 00 00 jmpq ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c149b: 0 80 bd 5e 04 00 00 01 cmpb $0x1,0x45e(%rbp)
ffffffff804c14a2: 0 76 29 jbe ffffffff804c14cd <tcp_ack+0x9b6>
ffffffff804c14a4: 0 41 f6 c4 34 test $0x34,%r12b
ffffffff804c14a8: 0 74 0f je ffffffff804c14b9 <tcp_ack+0x9a2>
ffffffff804c14aa: 0 44 89 e0 mov %r12d,%eax
ffffffff804c14ad: 0 25 20 02 00 00 and $0x220,%eax
ffffffff804c14b2: 0 83 f8 20 cmp $0x20,%eax
ffffffff804c14b5: 0 75 16 jne ffffffff804c14cd <tcp_ack+0x9b6>
ffffffff804c14b7: 0 eb 0a jmp ffffffff804c14c3 <tcp_ack+0x9ac>
ffffffff804c14b9: 0 41 f6 c4 17 test $0x17,%r12b
ffffffff804c14bd: 0 0f 85 01 01 00 00 jne ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c14c3: 0 44 89 e2 mov %r12d,%edx
ffffffff804c14c6: 0 be 03 00 00 00 mov $0x3,%esi
ffffffff804c14cb: 0 eb 86 jmp ffffffff804c1453 <tcp_ack+0x93c>
ffffffff804c14cd: 0 80 bd 5e 04 00 00 01 cmpb $0x1,0x45e(%rbp)
ffffffff804c14d4: 0 75 45 jne ffffffff804c151b <tcp_ack+0xa04>
ffffffff804c14d6: 0 8b 85 78 04 00 00 mov 0x478(%rbp),%eax
ffffffff804c14dc: 0 03 85 74 04 00 00 add 0x474(%rbp),%eax
ffffffff804c14e2: 0 48 89 ef mov %rbp,%rdi
ffffffff804c14e5: 0 c6 85 5e 04 00 00 02 movb $0x2,0x45e(%rbp)
ffffffff804c14ec: 0 83 c0 02 add $0x2,%eax
ffffffff804c14ef: 0 2b 85 cc 04 00 00 sub 0x4cc(%rbp),%eax
ffffffff804c14f5: 0 2b 85 d0 04 00 00 sub 0x4d0(%rbp),%eax
ffffffff804c14fb: 0 89 85 ac 04 00 00 mov %eax,0x4ac(%rbp)
ffffffff804c1501: 0 e8 0a 3e 00 00 callq ffffffff804c5310 <tcp_may_send_now>
ffffffff804c1506: 0 85 c0 test %eax,%eax
ffffffff804c1508: 0 0f 85 b6 00 00 00 jne ffffffff804c15c4 <tcp_ack+0xaad>
ffffffff804c150e: 0 44 89 e2 mov %r12d,%edx
ffffffff804c1511: 0 be 02 00 00 00 mov $0x2,%esi
ffffffff804c1516: 0 e9 38 ff ff ff jmpq ffffffff804c1453 <tcp_ack+0x93c>
ffffffff804c151b: 0 8b 05 3f 6f 3f 00 mov 0x3f6f3f(%rip),%eax # ffffffff808b8460 <sysctl_tcp_frto_response>
ffffffff804c1521: 0 83 f8 01 cmp $0x1,%eax
ffffffff804c1524: 0 74 1a je ffffffff804c1540 <tcp_ack+0xa29>
ffffffff804c1526: 0 83 f8 02 cmp $0x2,%eax
ffffffff804c1529: 0 75 5d jne ffffffff804c1588 <tcp_ack+0xa71>
ffffffff804c152b: 0 41 f6 c4 40 test $0x40,%r12b
ffffffff804c152f: 0 75 57 jne ffffffff804c1588 <tcp_ack+0xa71>
ffffffff804c1531: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff804c1536: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1539: 0 e8 5a db ff ff callq ffffffff804bf098 <tcp_undo_cwr>
ffffffff804c153e: 0 eb 50 jmp ffffffff804c1590 <tcp_ack+0xa79>
ffffffff804c1540: 0 8b 85 ac 04 00 00 mov 0x4ac(%rbp),%eax
ffffffff804c1546: 0 8b 95 a8 04 00 00 mov 0x4a8(%rbp),%edx
ffffffff804c154c: 0 c7 85 b0 04 00 00 00 movl $0x0,0x4b0(%rbp)
ffffffff804c1553: 0 00 00 00
ffffffff804c1556: 0 c7 85 dc 04 00 00 00 movl $0x0,0x4dc(%rbp)
ffffffff804c155d: 0 00 00 00
ffffffff804c1560: 0 39 c2 cmp %eax,%edx
ffffffff804c1562: 0 0f 46 c2 cmovbe %edx,%eax
ffffffff804c1565: 0 89 85 ac 04 00 00 mov %eax,0x4ac(%rbp)
ffffffff804c156b: 0 8a 85 7e 04 00 00 mov 0x47e(%rbp),%al
ffffffff804c1571: 0 a8 01 test $0x1,%al
ffffffff804c1573: 0 74 09 je ffffffff804c157e <tcp_ack+0xa67>
ffffffff804c1575: 0 83 c8 02 or $0x2,%eax
ffffffff804c1578: 0 88 85 7e 04 00 00 mov %al,0x47e(%rbp)
ffffffff804c157e: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1581: 0 e8 27 da ff ff callq ffffffff804befad <tcp_moderate_cwnd>
ffffffff804c1586: 0 eb 08 jmp ffffffff804c1590 <tcp_ack+0xa79>
ffffffff804c1588: 0 48 89 ef mov %rbp,%rdi
ffffffff804c158b: 0 e8 78 dd ff ff callq ffffffff804bf308 <tcp_ratehalving_spur_to_response>
ffffffff804c1590: 0 c6 85 5e 04 00 00 00 movb $0x0,0x45e(%rbp)
ffffffff804c1597: 0 c7 85 78 05 00 00 00 movl $0x0,0x578(%rbp)
ffffffff804c159e: 0 00 00 00
ffffffff804c15a1: 0 31 c9 xor %ecx,%ecx
ffffffff804c15a3: 0 48 8b 05 0e 01 5f 00 mov 0x5f010e(%rip),%rax # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c15aa: 0 65 8b 14 25 24 00 00 mov %gs:0x24,%edx
ffffffff804c15b1: 0 00
ffffffff804c15b2: 0 89 d2 mov %edx,%edx
ffffffff804c15b4: 0 48 f7 d0 not %rax
ffffffff804c15b7: 0 48 8b 04 d0 mov (%rax,%rdx,8),%rax
ffffffff804c15bb: 0 48 ff 80 28 02 00 00 incq 0x228(%rax)
ffffffff804c15c2: 0 eb 05 jmp ffffffff804c15c9 <tcp_ack+0xab2>
ffffffff804c15c4: 0 b9 01 00 00 00 mov $0x1,%ecx
ffffffff804c15c9: 466 8b 95 00 04 00 00 mov 0x400(%rbp),%edx
ffffffff804c15cf: 5645 39 95 58 04 00 00 cmp %edx,0x458(%rbp)
ffffffff804c15d5: 176 79 0a jns ffffffff804c15e1 <tcp_ack+0xaca>
ffffffff804c15d7: 24 c7 85 58 04 00 00 00 movl $0x0,0x458(%rbp)
ffffffff804c15de: 0 00 00 00
ffffffff804c15e1: 620 8b 54 24 2c mov 0x2c(%rsp),%edx
ffffffff804c15e5: 639 03 54 24 30 add 0x30(%rsp),%edx
ffffffff804c15e9: 2 44 89 e3 mov %r12d,%ebx
ffffffff804c15ec: 283 2b 54 24 28 sub 0x28(%rsp),%edx
ffffffff804c15f0: 154 2b 54 24 24 sub 0x24(%rsp),%edx
ffffffff804c15f4: 0 83 e3 17 and $0x17,%ebx
ffffffff804c15f7: 266 89 5c 24 54 mov %ebx,0x54(%rsp)
ffffffff804c15fb: 168 74 13 je ffffffff804c1610 <tcp_ack+0xaf9>
ffffffff804c15fd: 0 41 f6 c4 60 test $0x60,%r12b
ffffffff804c1601: 6575 75 0d jne ffffffff804c1610 <tcp_ack+0xaf9>
ffffffff804c1603: 20 80 bd 78 03 00 00 00 cmpb $0x0,0x378(%rbp)
ffffffff804c160a: 1417 0f 84 3a 09 00 00 je ffffffff804c1f4a <tcp_ack+0x1433>
ffffffff804c1610: 0 44 89 e0 mov %r12d,%eax
ffffffff804c1613: 0 c1 e8 02 shr $0x2,%eax
ffffffff804c1616: 0 88 c3 mov %al,%bl
ffffffff804c1618: 0 80 e3 01 and $0x1,%bl
ffffffff804c161b: 0 41 88 de mov %bl,%r14b
ffffffff804c161e: 0 74 36 je ffffffff804c1656 <tcp_ack+0xb3f>
ffffffff804c1620: 0 85 c9 test %ecx,%ecx
ffffffff804c1622: 0 75 32 jne ffffffff804c1656 <tcp_ack+0xb3f>
ffffffff804c1624: 0 41 f6 c4 40 test $0x40,%r12b
ffffffff804c1628: 0 74 0e je ffffffff804c1638 <tcp_ack+0xb21>
ffffffff804c162a: 0 8b 85 a8 04 00 00 mov 0x4a8(%rbp),%eax
ffffffff804c1630: 0 39 85 ac 04 00 00 cmp %eax,0x4ac(%rbp)
ffffffff804c1636: 0 73 1e jae ffffffff804c1656 <tcp_ack+0xb3f>
ffffffff804c1638: 0 0f b6 8d 78 03 00 00 movzbl 0x378(%rbp),%ecx
ffffffff804c163f: 0 b8 0c 00 00 00 mov $0xc,%eax
ffffffff804c1644: 0 d3 f8 sar %cl,%eax
ffffffff804c1646: 0 a8 01 test $0x1,%al
ffffffff804c1648: 0 75 0c jne ffffffff804c1656 <tcp_ack+0xb3f>
ffffffff804c164a: 0 8b 74 24 1c mov 0x1c(%rsp),%esi
ffffffff804c164e: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1651: 0 e8 6b dc ff ff callq ffffffff804bf2c1 <tcp_cong_avoid>
ffffffff804c1656: 0 31 db xor %ebx,%ebx
ffffffff804c1658: 0 41 f7 c4 17 04 00 00 test $0x417,%r12d
ffffffff804c165f: 0 44 8b bd 74 04 00 00 mov 0x474(%rbp),%r15d
ffffffff804c1666: 0 0f 94 c3 sete %bl
ffffffff804c1669: 0 41 bd 01 00 00 00 mov $0x1,%r13d
ffffffff804c166f: 0 85 db test %ebx,%ebx
ffffffff804c1671: 0 75 21 jne ffffffff804c1694 <tcp_ack+0xb7d>
ffffffff804c1673: 0 45 30 ed xor %r13b,%r13b
ffffffff804c1676: 0 41 f6 c4 20 test $0x20,%r12b
ffffffff804c167a: 0 74 18 je ffffffff804c1694 <tcp_ack+0xb7d>
ffffffff804c167c: 0 48 89 ef mov %rbp,%rdi
ffffffff804c167f: 0 45 31 ed xor %r13d,%r13d
ffffffff804c1682: 0 e8 cf d8 ff ff callq ffffffff804bef56 <tcp_fackets_out>
ffffffff804c1687: 0 0f b6 95 7f 04 00 00 movzbl 0x47f(%rbp),%edx
ffffffff804c168e: 0 39 d0 cmp %edx,%eax
ffffffff804c1690: 0 41 0f 9f c5 setg %r13b
ffffffff804c1694: 0 83 bd 74 04 00 00 00 cmpl $0x0,0x474(%rbp)
ffffffff804c169b: 0 75 24 jne ffffffff804c16c1 <tcp_ack+0xbaa>
ffffffff804c169d: 0 83 bd d0 04 00 00 00 cmpl $0x0,0x4d0(%rbp)
ffffffff804c16a4: 0 74 1b je ffffffff804c16c1 <tcp_ack+0xbaa>
ffffffff804c16a6: 0 be 16 0a 00 00 mov $0xa16,%esi
ffffffff804c16ab: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi
ffffffff804c16b2: 0 e8 fe 4a d7 ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804c16b7: 0 c7 85 d0 04 00 00 00 movl $0x0,0x4d0(%rbp)
ffffffff804c16be: 0 00 00 00
ffffffff804c16c1: 0 83 bd d0 04 00 00 00 cmpl $0x0,0x4d0(%rbp)
ffffffff804c16c8: 0 75 24 jne ffffffff804c16ee <tcp_ack+0xbd7>
ffffffff804c16ca: 0 83 bd d4 04 00 00 00 cmpl $0x0,0x4d4(%rbp)
ffffffff804c16d1: 0 74 1b je ffffffff804c16ee <tcp_ack+0xbd7>
ffffffff804c16d3: 0 be 18 0a 00 00 mov $0xa18,%esi
ffffffff804c16d8: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi
ffffffff804c16df: 0 e8 d1 4a d7 ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804c16e4: 0 c7 85 d4 04 00 00 00 movl $0x0,0x4d4(%rbp)
ffffffff804c16eb: 0 00 00 00
ffffffff804c16ee: 0 44 89 e0 mov %r12d,%eax
ffffffff804c16f1: 0 83 e0 40 and $0x40,%eax
ffffffff804c16f4: 0 89 44 24 58 mov %eax,0x58(%rsp)
ffffffff804c16f8: 0 74 0a je ffffffff804c1704 <tcp_ack+0xbed>
ffffffff804c16fa: 0 c7 85 6c 05 00 00 00 movl $0x0,0x56c(%rbp)
ffffffff804c1701: 0 00 00 00
ffffffff804c1704: 0 41 f7 c4 00 20 00 00 test $0x2000,%r12d
ffffffff804c170b: 0 0f 84 50 08 00 00 je ffffffff804c1f61 <tcp_ack+0x144a>
ffffffff804c1711: 0 48 8b 15 a0 ff 5e 00 mov 0x5effa0(%rip),%rdx # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c1718: 0 48 89 ef mov %rbp,%rdi
ffffffff804c171b: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff804c1720: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax
ffffffff804c1727: 0 00
ffffffff804c1728: 0 89 c0 mov %eax,%eax
ffffffff804c172a: 0 48 f7 d2 not %rdx
ffffffff804c172d: 0 48 8b 04 c2 mov (%rdx,%rax,8),%rax
ffffffff804c1731: 0 48 ff 80 00 01 00 00 incq 0x100(%rax)
ffffffff804c1738: 0 e8 df e2 ff ff callq ffffffff804bfa1c <tcp_enter_loss>
ffffffff804c173d: 0 48 8b b5 c0 00 00 00 mov 0xc0(%rbp),%rsi
ffffffff804c1744: 0 fe 85 79 03 00 00 incb 0x379(%rbp)
ffffffff804c174a: 0 48 8d 85 c0 00 00 00 lea 0xc0(%rbp),%rax
ffffffff804c1751: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1754: 0 48 39 c6 cmp %rax,%rsi
ffffffff804c1757: 0 b8 00 00 00 00 mov $0x0,%eax
ffffffff804c175c: 0 48 0f 44 f0 cmove %rax,%rsi
ffffffff804c1760: 0 e8 2d 4b 00 00 callq ffffffff804c6292 <tcp_retransmit_skb>
ffffffff804c1765: 0 8b 95 58 03 00 00 mov 0x358(%rbp),%edx
ffffffff804c176b: 0 b9 30 75 00 00 mov $0x7530,%ecx
ffffffff804c1770: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff804c1775: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1778: 0 e8 66 de ff ff callq ffffffff804bf5e3 <inet_csk_reset_xmit_timer>
ffffffff804c177d: 0 e9 dd 06 00 00 jmpq ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1782: 0 45 84 e4 test %r12b,%r12b
ffffffff804c1785: 0 79 51 jns ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c1787: 0 8b 95 70 05 00 00 mov 0x570(%rbp),%edx
ffffffff804c178d: 0 39 95 00 04 00 00 cmp %edx,0x400(%rbp)
ffffffff804c1793: 0 79 43 jns ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c1795: 0 80 bd 78 03 00 00 00 cmpb $0x0,0x378(%rbp)
ffffffff804c179c: 0 74 3a je ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c179e: 0 0f b6 85 7f 04 00 00 movzbl 0x47f(%rbp),%eax
ffffffff804c17a5: 0 8b b5 d4 04 00 00 mov 0x4d4(%rbp),%esi
ffffffff804c17ab: 0 39 c6 cmp %eax,%esi
ffffffff804c17ad: 0 76 29 jbe ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c17af: 0 29 c6 sub %eax,%esi
ffffffff804c17b1: 0 48 89 ef mov %rbp,%rdi
ffffffff804c17b4: 0 e8 58 e6 ff ff callq ffffffff804bfe11 <tcp_mark_head_lost>
ffffffff804c17b9: 0 48 8b 05 f8 fe 5e 00 mov 0x5efef8(%rip),%rax # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c17c0: 0 65 8b 14 25 24 00 00 mov %gs:0x24,%edx
ffffffff804c17c7: 0 00
ffffffff804c17c8: 0 89 d2 mov %edx,%edx
ffffffff804c17ca: 0 48 f7 d0 not %rax
ffffffff804c17cd: 0 48 8b 04 d0 mov (%rax,%rdx,8),%rax
ffffffff804c17d1: 0 48 ff 80 48 01 00 00 incq 0x148(%rax)
ffffffff804c17d8: 0 8b 85 cc 04 00 00 mov 0x4cc(%rbp),%eax
ffffffff804c17de: 0 03 85 d0 04 00 00 add 0x4d0(%rbp),%eax
ffffffff804c17e4: 0 3b 85 74 04 00 00 cmp 0x474(%rbp),%eax
ffffffff804c17ea: 0 76 11 jbe ffffffff804c17fd <tcp_ack+0xce6>
ffffffff804c17ec: 0 be 2e 0a 00 00 mov $0xa2e,%esi
ffffffff804c17f1: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi
ffffffff804c17f8: 0 e8 b8 49 d7 ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804c17fd: 0 8a 85 78 03 00 00 mov 0x378(%rbp),%al
ffffffff804c1803: 0 84 c0 test %al,%al
ffffffff804c1805: 0 75 29 jne ffffffff804c1830 <tcp_ack+0xd19>
ffffffff804c1807: 0 83 bd 78 04 00 00 00 cmpl $0x0,0x478(%rbp)
ffffffff804c180e: 0 74 11 je ffffffff804c1821 <tcp_ack+0xd0a>
ffffffff804c1810: 0 be 33 0a 00 00 mov $0xa33,%esi
ffffffff804c1815: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi
ffffffff804c181c: 0 e8 94 49 d7 ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804c1821: 0 c7 85 74 05 00 00 00 movl $0x0,0x574(%rbp)
ffffffff804c1828: 0 00 00 00
ffffffff804c182b: 0 e9 c4 00 00 00 jmpq ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1830: 0 8b 8d 70 05 00 00 mov 0x570(%rbp),%ecx
ffffffff804c1836: 0 8b 95 00 04 00 00 mov 0x400(%rbp),%edx
ffffffff804c183c: 0 39 ca cmp %ecx,%edx
ffffffff804c183e: 0 0f 88 b0 00 00 00 js ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1844: 0 3c 02 cmp $0x2,%al
ffffffff804c1846: 0 74 31 je ffffffff804c1879 <tcp_ack+0xd62>
ffffffff804c1848: 0 77 0a ja ffffffff804c1854 <tcp_ack+0xd3d>
ffffffff804c184a: 0 fe c8 dec %al
ffffffff804c184c: 0 0f 85 a2 00 00 00 jne ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1852: 0 eb 33 jmp ffffffff804c1887 <tcp_ack+0xd70>
ffffffff804c1854: 0 3c 03 cmp $0x3,%al
ffffffff804c1856: 0 74 6f je ffffffff804c18c7 <tcp_ack+0xdb0>
ffffffff804c1858: 0 3c 04 cmp $0x4,%al
ffffffff804c185a: 0 0f 85 94 00 00 00 jne ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1860: 0 c6 85 79 03 00 00 00 movb $0x0,0x379(%rbp)
ffffffff804c1867: 0 48 89 ef mov %rbp,%rdi
ffffffff804c186a: 0 e8 fb d8 ff ff callq ffffffff804bf16a <tcp_try_undo_recovery>
ffffffff804c186f: 0 85 c0 test %eax,%eax
ffffffff804c1871: 0 0f 85 e8 05 00 00 jne ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1877: 0 eb 7b jmp ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c1879: 0 39 ca cmp %ecx,%edx
ffffffff804c187b: 0 74 77 je ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c187d: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1880: 0 e8 b8 d9 ff ff callq ffffffff804bf23d <tcp_complete_cwr>
ffffffff804c1885: 0 eb 34 jmp ffffffff804c18bb <tcp_ack+0xda4>
ffffffff804c1887: 0 48 89 ef mov %rbp,%rdi
ffffffff804c188a: 0 e8 63 d9 ff ff callq ffffffff804bf1f2 <tcp_try_undo_dsack>
ffffffff804c188f: 0 83 bd 78 05 00 00 00 cmpl $0x0,0x578(%rbp)
ffffffff804c1896: 0 74 19 je ffffffff804c18b1 <tcp_ack+0xd9a>
ffffffff804c1898: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c189e: 0 c0 e8 04 shr $0x4,%al
ffffffff804c18a1: 0 74 0e je ffffffff804c18b1 <tcp_ack+0xd9a>
ffffffff804c18a3: 0 8b 85 70 05 00 00 mov 0x570(%rbp),%eax
ffffffff804c18a9: 0 39 85 00 04 00 00 cmp %eax,0x400(%rbp)
ffffffff804c18af: 0 74 43 je ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c18b1: 0 c7 85 78 05 00 00 00 movl $0x0,0x578(%rbp)
ffffffff804c18b8: 0 00 00 00
ffffffff804c18bb: 0 31 f6 xor %esi,%esi
ffffffff804c18bd: 0 48 89 ef mov %rbp,%rdi
ffffffff804c18c0: 0 e8 b4 cf ff ff callq ffffffff804be879 <tcp_set_ca_state>
ffffffff804c18c5: 0 eb 2d jmp ffffffff804c18f4 <tcp_ack+0xddd>
ffffffff804c18c7: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c18cd: 0 c0 e8 04 shr $0x4,%al
ffffffff804c18d0: 0 75 0a jne ffffffff804c18dc <tcp_ack+0xdc5>
ffffffff804c18d2: 0 c7 85 d0 04 00 00 00 movl $0x0,0x4d0(%rbp)
ffffffff804c18d9: 0 00 00 00
ffffffff804c18dc: 0 48 89 ef mov %rbp,%rdi
ffffffff804c18df: 0 e8 86 d8 ff ff callq ffffffff804bf16a <tcp_try_undo_recovery>
ffffffff804c18e4: 0 85 c0 test %eax,%eax
ffffffff804c18e6: 0 0f 85 73 05 00 00 jne ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c18ec: 0 48 89 ef mov %rbp,%rdi
ffffffff804c18ef: 0 e8 49 d9 ff ff callq ffffffff804bf23d <tcp_complete_cwr>
ffffffff804c18f4: 0 8a 85 78 03 00 00 mov 0x378(%rbp),%al
ffffffff804c18fa: 0 3c 03 cmp $0x3,%al
ffffffff804c18fc: 0 74 0d je ffffffff804c190b <tcp_ack+0xdf4>
ffffffff804c18fe: 0 3c 04 cmp $0x4,%al
ffffffff804c1900: 0 0f 85 b8 01 00 00 jne ffffffff804c1abe <tcp_ack+0xfa7>
ffffffff804c1906: 0 e9 c4 00 00 00 jmpq ffffffff804c19cf <tcp_ack+0xeb8>
ffffffff804c190b: 0 41 f7 c4 00 04 00 00 test $0x400,%r12d
ffffffff804c1912: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c1918: 0 75 1e jne ffffffff804c1938 <tcp_ack+0xe21>
ffffffff804c191a: 0 c0 e8 04 shr $0x4,%al
ffffffff804c191d: 0 0f 85 fd 03 00 00 jne ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c1923: 0 85 db test %ebx,%ebx
ffffffff804c1925: 0 0f 84 f5 03 00 00 je ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c192b: 0 48 89 ef mov %rbp,%rdi
ffffffff804c192e: 0 e8 54 dd ff ff callq ffffffff804bf687 <tcp_add_reno_sack>
ffffffff804c1933: 0 e9 e8 03 00 00 jmpq ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c1938: 0 c0 e8 04 shr $0x4,%al
ffffffff804c193b: 0 41 bd 01 00 00 00 mov $0x1,%r13d
ffffffff804c1941: 0 74 18 je ffffffff804c195b <tcp_ack+0xe44>
ffffffff804c1943: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1946: 0 45 31 ed xor %r13d,%r13d
ffffffff804c1949: 0 e8 08 d6 ff ff callq ffffffff804bef56 <tcp_fackets_out>
ffffffff804c194e: 0 0f b6 95 7f 04 00 00 movzbl 0x47f(%rbp),%edx
ffffffff804c1955: 0 39 d0 cmp %edx,%eax
ffffffff804c1957: 0 41 0f 9f c5 setg %r13b
ffffffff804c195b: 0 48 89 ef mov %rbp,%rdi
ffffffff804c195e: 0 e8 c9 d7 ff ff callq ffffffff804bf12c <tcp_may_undo>
ffffffff804c1963: 0 85 c0 test %eax,%eax
ffffffff804c1965: 0 0f 84 b5 03 00 00 je ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c196b: 0 83 bd 78 04 00 00 00 cmpl $0x0,0x478(%rbp)
ffffffff804c1972: 0 75 0a jne ffffffff804c197e <tcp_ack+0xe67>
ffffffff804c1974: 0 c7 85 74 05 00 00 00 movl $0x0,0x574(%rbp)
ffffffff804c197b: 0 00 00 00
ffffffff804c197e: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1981: 0 45 31 ed xor %r13d,%r13d
ffffffff804c1984: 0 e8 cd d5 ff ff callq ffffffff804bef56 <tcp_fackets_out>
ffffffff804c1989: 0 44 29 7c 24 14 sub %r15d,0x14(%rsp)
ffffffff804c198e: 0 ba 01 00 00 00 mov $0x1,%edx
ffffffff804c1993: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1996: 0 8b 74 24 14 mov 0x14(%rsp),%esi
ffffffff804c199a: 0 01 c6 add %eax,%esi
ffffffff804c199c: 0 e8 ed d3 ff ff callq ffffffff804bed8e <tcp_update_reordering>
ffffffff804c19a1: 0 31 f6 xor %esi,%esi
ffffffff804c19a3: 0 48 89 ef mov %rbp,%rdi
ffffffff804c19a6: 0 e8 ed d6 ff ff callq ffffffff804bf098 <tcp_undo_cwr>
ffffffff804c19ab: 0 48 8b 15 06 fd 5e 00 mov 0x5efd06(%rip),%rdx # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c19b2: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax
ffffffff804c19b9: 0 00
ffffffff804c19ba: 0 89 c0 mov %eax,%eax
ffffffff804c19bc: 0 48 f7 d2 not %rdx
ffffffff804c19bf: 0 48 8b 04 c2 mov (%rdx,%rax,8),%rax
ffffffff804c19c3: 0 48 ff 80 30 01 00 00 incq 0x130(%rax)
ffffffff804c19ca: 0 e9 51 03 00 00 jmpq ffffffff804c1d20 <tcp_ack+0x1209>
ffffffff804c19cf: 0 45 84 f6 test %r14b,%r14b
ffffffff804c19d2: 0 74 07 je ffffffff804c19db <tcp_ack+0xec4>
ffffffff804c19d4: 0 c6 85 79 03 00 00 00 movb $0x0,0x379(%rbp)
ffffffff804c19db: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c19e1: 0 c0 e8 04 shr $0x4,%al
ffffffff804c19e4: 0 75 13 jne ffffffff804c19f9 <tcp_ack+0xee2>
ffffffff804c19e6: 0 41 f7 c4 00 04 00 00 test $0x400,%r12d
ffffffff804c19ed: 0 74 0a je ffffffff804c19f9 <tcp_ack+0xee2>
ffffffff804c19ef: 0 c7 85 d0 04 00 00 00 movl $0x0,0x4d0(%rbp)
ffffffff804c19f6: 0 00 00 00
ffffffff804c19f9: 0 48 89 ef mov %rbp,%rdi
ffffffff804c19fc: 0 e8 2b d7 ff ff callq ffffffff804bf12c <tcp_may_undo>
ffffffff804c1a01: 0 85 c0 test %eax,%eax
ffffffff804c1a03: 0 0f 84 6e 05 00 00 je ffffffff804c1f77 <tcp_ack+0x1460>
ffffffff804c1a09: 0 48 8b 95 c0 00 00 00 mov 0xc0(%rbp),%rdx
ffffffff804c1a10: 0 48 8d 8d c0 00 00 00 lea 0xc0(%rbp),%rcx
ffffffff804c1a17: 0 eb 10 jmp ffffffff804c1a29 <tcp_ack+0xf12>
ffffffff804c1a19: 0 48 3b 95 d8 01 00 00 cmp 0x1d8(%rbp),%rdx
ffffffff804c1a20: 0 74 12 je ffffffff804c1a34 <tcp_ack+0xf1d>
ffffffff804c1a22: 0 80 62 5d fb andb $0xfb,0x5d(%rdx)
ffffffff804c1a26: 0 48 8b 12 mov (%rdx),%rdx
ffffffff804c1a29: 0 48 8b 02 mov (%rdx),%rax
ffffffff804c1a2c: 0 48 39 ca cmp %rcx,%rdx
ffffffff804c1a2f: 0 0f 18 08 prefetcht0 (%rax)
ffffffff804c1a32: 0 75 e5 jne ffffffff804c1a19 <tcp_ack+0xf02>
ffffffff804c1a34: 0 48 c7 85 e0 04 00 00 movq $0x0,0x4e0(%rbp)
ffffffff804c1a3b: 0 00 00 00 00
ffffffff804c1a3f: 0 48 c7 85 e8 04 00 00 movq $0x0,0x4e8(%rbp)
ffffffff804c1a46: 0 00 00 00 00
ffffffff804c1a4a: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff804c1a4f: 0 48 c7 85 f0 04 00 00 movq $0x0,0x4f0(%rbp)
ffffffff804c1a56: 0 00 00 00 00
ffffffff804c1a5a: 0 c7 85 cc 04 00 00 00 movl $0x0,0x4cc(%rbp)
ffffffff804c1a61: 0 00 00 00
ffffffff804c1a64: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1a67: 0 e8 2c d6 ff ff callq ffffffff804bf098 <tcp_undo_cwr>
ffffffff804c1a6c: 0 48 8b 15 45 fc 5e 00 mov 0x5efc45(%rip),%rdx # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c1a73: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax
ffffffff804c1a7a: 0 00
ffffffff804c1a7b: 0 89 c0 mov %eax,%eax
ffffffff804c1a7d: 0 48 f7 d2 not %rdx
ffffffff804c1a80: 0 48 8b 04 c2 mov (%rdx,%rax,8),%rax
ffffffff804c1a84: 0 48 ff 80 40 01 00 00 incq 0x140(%rax)
ffffffff804c1a8b: 0 c6 85 79 03 00 00 00 movb $0x0,0x379(%rbp)
ffffffff804c1a92: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c1a98: 0 c7 85 78 05 00 00 00 movl $0x0,0x578(%rbp)
ffffffff804c1a9f: 0 00 00 00
ffffffff804c1aa2: 0 c0 e8 04 shr $0x4,%al
ffffffff804c1aa5: 0 74 0a je ffffffff804c1ab1 <tcp_ack+0xf9a>
ffffffff804c1aa7: 0 31 f6 xor %esi,%esi
ffffffff804c1aa9: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1aac: 0 e8 c8 cd ff ff callq ffffffff804be879 <tcp_set_ca_state>
ffffffff804c1ab1: 0 80 bd 78 03 00 00 00 cmpb $0x0,0x378(%rbp)
ffffffff804c1ab8: 0 0f 85 a1 03 00 00 jne ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1abe: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c1ac4: 0 c0 e8 04 shr $0x4,%al
ffffffff804c1ac7: 0 75 1f jne ffffffff804c1ae8 <tcp_ack+0xfd1>
ffffffff804c1ac9: 0 41 f7 c4 00 04 00 00 test $0x400,%r12d
ffffffff804c1ad0: 0 74 0a je ffffffff804c1adc <tcp_ack+0xfc5>
ffffffff804c1ad2: 0 c7 85 d0 04 00 00 00 movl $0x0,0x4d0(%rbp)
ffffffff804c1ad9: 0 00 00 00
ffffffff804c1adc: 0 85 db test %ebx,%ebx
ffffffff804c1ade: 0 74 08 je ffffffff804c1ae8 <tcp_ack+0xfd1>
ffffffff804c1ae0: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1ae3: 0 e8 9f db ff ff callq ffffffff804bf687 <tcp_add_reno_sack>
ffffffff804c1ae8: 0 80 bd 78 03 00 00 01 cmpb $0x1,0x378(%rbp)
ffffffff804c1aef: 0 75 08 jne ffffffff804c1af9 <tcp_ack+0xfe2>
ffffffff804c1af1: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1af4: 0 e8 f9 d6 ff ff callq ffffffff804bf1f2 <tcp_try_undo_dsack>
ffffffff804c1af9: 0 80 bd 5e 04 00 00 00 cmpb $0x0,0x45e(%rbp)
ffffffff804c1b00: 0 0f 85 90 00 00 00 jne ffffffff804c1b96 <tcp_ack+0x107f>
ffffffff804c1b06: 0 83 bd cc 04 00 00 00 cmpl $0x0,0x4cc(%rbp)
ffffffff804c1b0d: 0 0f 85 79 04 00 00 jne ffffffff804c1f8c <tcp_ack+0x1475>
ffffffff804c1b13: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c1b19: 0 c0 e8 04 shr $0x4,%al
ffffffff804c1b1c: 0 a8 02 test $0x2,%al
ffffffff804c1b1e: 0 74 08 je ffffffff804c1b28 <tcp_ack+0x1011>
ffffffff804c1b20: 0 8b 95 d4 04 00 00 mov 0x4d4(%rbp),%edx
ffffffff804c1b26: 0 eb 08 jmp ffffffff804c1b30 <tcp_ack+0x1019>
ffffffff804c1b28: 0 8b 95 d0 04 00 00 mov 0x4d0(%rbp),%edx
ffffffff804c1b2e: 0 ff c2 inc %edx
ffffffff804c1b30: 0 0f b6 85 7f 04 00 00 movzbl 0x47f(%rbp),%eax
ffffffff804c1b37: 0 39 c2 cmp %eax,%edx
ffffffff804c1b39: 0 0f 8f 4d 04 00 00 jg ffffffff804c1f8c <tcp_ack+0x1475>
ffffffff804c1b3f: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c1b45: 0 c0 e8 04 shr $0x4,%al
ffffffff804c1b48: 0 a8 02 test $0x2,%al
ffffffff804c1b4a: 0 74 10 je ffffffff804c1b5c <tcp_ack+0x1045>
ffffffff804c1b4c: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1b4f: 0 e8 1d d4 ff ff callq ffffffff804bef71 <tcp_head_timedout>
ffffffff804c1b54: 0 85 c0 test %eax,%eax
ffffffff804c1b56: 0 0f 85 30 04 00 00 jne ffffffff804c1f8c <tcp_ack+0x1475>
ffffffff804c1b5c: 0 0f b6 85 7f 04 00 00 movzbl 0x47f(%rbp),%eax
ffffffff804c1b63: 0 8b 95 74 04 00 00 mov 0x474(%rbp),%edx
ffffffff804c1b69: 0 39 c2 cmp %eax,%edx
ffffffff804c1b6b: 0 77 29 ja ffffffff804c1b96 <tcp_ack+0x107f>
ffffffff804c1b6d: 0 89 d0 mov %edx,%eax
ffffffff804c1b6f: 0 d1 e8 shr %eax
ffffffff804c1b71: 0 39 05 c1 68 3f 00 cmp %eax,0x3f68c1(%rip) # ffffffff808b8438 <sysctl_tcp_reordering>
ffffffff804c1b77: 0 0f 43 05 ba 68 3f 00 cmovae 0x3f68ba(%rip),%eax # ffffffff808b8438 <sysctl_tcp_reordering>
ffffffff804c1b7e: 0 39 85 d0 04 00 00 cmp %eax,0x4d0(%rbp)
ffffffff804c1b84: 0 72 10 jb ffffffff804c1b96 <tcp_ack+0x107f>
ffffffff804c1b86: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1b89: 0 e8 82 37 00 00 callq ffffffff804c5310 <tcp_may_send_now>
ffffffff804c1b8e: 0 85 c0 test %eax,%eax
ffffffff804c1b90: 0 0f 84 f6 03 00 00 je ffffffff804c1f8c <tcp_ack+0x1475>
ffffffff804c1b96: 0 8b 85 cc 04 00 00 mov 0x4cc(%rbp),%eax
ffffffff804c1b9c: 0 03 85 d0 04 00 00 add 0x4d0(%rbp),%eax
ffffffff804c1ba2: 0 3b 85 74 04 00 00 cmp 0x474(%rbp),%eax
ffffffff804c1ba8: 0 76 11 jbe ffffffff804c1bbb <tcp_ack+0x10a4>
ffffffff804c1baa: 0 be d7 09 00 00 mov $0x9d7,%esi
ffffffff804c1baf: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi
ffffffff804c1bb6: 0 e8 fa 45 d7 ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804c1bbb: 0 80 bd 5e 04 00 00 00 cmpb $0x0,0x45e(%rbp)
ffffffff804c1bc2: 0 75 13 jne ffffffff804c1bd7 <tcp_ack+0x10c0>
ffffffff804c1bc4: 0 83 bd 78 04 00 00 00 cmpl $0x0,0x478(%rbp)
ffffffff804c1bcb: 0 75 0a jne ffffffff804c1bd7 <tcp_ack+0x10c0>
ffffffff804c1bcd: 0 c7 85 74 05 00 00 00 movl $0x0,0x574(%rbp)
ffffffff804c1bd4: 0 00 00 00
ffffffff804c1bd7: 0 83 7c 24 58 00 cmpl $0x0,0x58(%rsp)
ffffffff804c1bdc: 0 74 0d je ffffffff804c1beb <tcp_ack+0x10d4>
ffffffff804c1bde: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff804c1be3: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1be6: 0 e8 cf d0 ff ff callq ffffffff804becba <tcp_enter_cwr>
ffffffff804c1beb: 0 80 bd 78 03 00 00 02 cmpb $0x2,0x378(%rbp)
ffffffff804c1bf2: 0 74 15 je ffffffff804c1c09 <tcp_ack+0x10f2>
ffffffff804c1bf4: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1bf7: 0 e8 71 d6 ff ff callq ffffffff804bf26d <tcp_try_keep_open>
ffffffff804c1bfc: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1bff: 0 e8 a9 d3 ff ff callq ffffffff804befad <tcp_moderate_cwnd>
ffffffff804c1c04: 0 e9 56 02 00 00 jmpq ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1c09: 0 44 89 e6 mov %r12d,%esi
ffffffff804c1c0c: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1c0f: 0 e8 d9 d3 ff ff callq ffffffff804befed <tcp_cwnd_down>
ffffffff804c1c14: 0 e9 46 02 00 00 jmpq ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1c19: 0 8b 95 a4 03 00 00 mov 0x3a4(%rbp),%edx
ffffffff804c1c1f: 0 85 d2 test %edx,%edx
ffffffff804c1c21: 0 74 34 je ffffffff804c1c57 <tcp_ack+0x1140>
ffffffff804c1c23: 0 8b 85 b0 05 00 00 mov 0x5b0(%rbp),%eax
ffffffff804c1c29: 0 39 85 00 04 00 00 cmp %eax,0x400(%rbp)
ffffffff804c1c2f: 0 75 26 jne ffffffff804c1c57 <tcp_ack+0x1140>
ffffffff804c1c31: 0 ff 85 ac 04 00 00 incl 0x4ac(%rbp)
ffffffff804c1c37: 0 8d 42 ff lea -0x1(%rdx),%eax
ffffffff804c1c3a: 0 c7 85 a4 03 00 00 00 movl $0x0,0x3a4(%rbp)
ffffffff804c1c41: 0 00 00 00
ffffffff804c1c44: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1c47: 0 89 85 9c 03 00 00 mov %eax,0x39c(%rbp)
ffffffff804c1c4d: 0 e8 86 54 00 00 callq ffffffff804c70d8 <tcp_simple_retransmit>
ffffffff804c1c52: 0 e9 08 02 00 00 jmpq ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1c57: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c1c5d: 0 48 8b 15 54 fa 5e 00 mov 0x5efa54(%rip),%rdx # ffffffff80ab16b8 <init_net+0xe8>
ffffffff804c1c64: 0 c0 e8 04 shr $0x4,%al
ffffffff804c1c67: 0 48 f7 d2 not %rdx
ffffffff804c1c6a: 0 3c 01 cmp $0x1,%al
ffffffff804c1c6c: 0 19 c9 sbb %ecx,%ecx
ffffffff804c1c6e: 0 65 8b 04 25 24 00 00 mov %gs:0x24,%eax
ffffffff804c1c75: 0 00
ffffffff804c1c76: 0 89 c0 mov %eax,%eax
ffffffff804c1c78: 0 83 c1 1f add $0x1f,%ecx
ffffffff804c1c7b: 0 48 8b 04 c2 mov (%rdx,%rax,8),%rax
ffffffff804c1c7f: 0 48 63 c9 movslq %ecx,%rcx
ffffffff804c1c82: 0 48 ff 04 c8 incq (%rax,%rcx,8)
ffffffff804c1c86: 0 c7 85 6c 05 00 00 00 movl $0x0,0x56c(%rbp)
ffffffff804c1c8d: 0 00 00 00
ffffffff804c1c90: 0 8b 85 fc 03 00 00 mov 0x3fc(%rbp),%eax
ffffffff804c1c96: 0 80 bd 78 03 00 00 01 cmpb $0x1,0x378(%rbp)
ffffffff804c1c9d: 0 89 85 70 05 00 00 mov %eax,0x570(%rbp)
ffffffff804c1ca3: 0 8b 85 00 04 00 00 mov 0x400(%rbp),%eax
ffffffff804c1ca9: 0 89 85 78 05 00 00 mov %eax,0x578(%rbp)
ffffffff804c1caf: 0 8b 85 78 04 00 00 mov 0x478(%rbp),%eax
ffffffff804c1cb5: 0 89 85 7c 05 00 00 mov %eax,0x57c(%rbp)
ffffffff804c1cbb: 0 77 3b ja ffffffff804c1cf8 <tcp_ack+0x11e1>
ffffffff804c1cbd: 0 83 7c 24 58 00 cmpl $0x0,0x58(%rsp)
ffffffff804c1cc2: 0 75 0e jne ffffffff804c1cd2 <tcp_ack+0x11bb>
ffffffff804c1cc4: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1cc7: 0 e8 f0 cb ff ff callq ffffffff804be8bc <tcp_current_ssthresh>
ffffffff804c1ccc: 0 89 85 6c 05 00 00 mov %eax,0x56c(%rbp)
ffffffff804c1cd2: 0 48 8b 85 60 03 00 00 mov 0x360(%rbp),%rax
ffffffff804c1cd9: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1cdc: 0 ff 50 28 callq *0x28(%rax)
ffffffff804c1cdf: 0 89 85 a8 04 00 00 mov %eax,0x4a8(%rbp)
ffffffff804c1ce5: 0 8a 85 7e 04 00 00 mov 0x47e(%rbp),%al
ffffffff804c1ceb: 0 a8 01 test $0x1,%al
ffffffff804c1ced: 0 74 09 je ffffffff804c1cf8 <tcp_ack+0x11e1>
ffffffff804c1cef: 0 83 c8 02 or $0x2,%eax
ffffffff804c1cf2: 0 88 85 7e 04 00 00 mov %al,0x47e(%rbp)
ffffffff804c1cf8: 0 c7 85 dc 04 00 00 00 movl $0x0,0x4dc(%rbp)
ffffffff804c1cff: 0 00 00 00
ffffffff804c1d02: 0 c7 85 b0 04 00 00 00 movl $0x0,0x4b0(%rbp)
ffffffff804c1d09: 0 00 00 00
ffffffff804c1d0c: 0 be 03 00 00 00 mov $0x3,%esi
ffffffff804c1d11: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1d14: 0 bb 01 00 00 00 mov $0x1,%ebx
ffffffff804c1d19: 0 e8 5b cb ff ff callq ffffffff804be879 <tcp_set_ca_state>
ffffffff804c1d1e: 0 eb 02 jmp ffffffff804c1d22 <tcp_ack+0x120b>
ffffffff804c1d20: 0 31 db xor %ebx,%ebx
ffffffff804c1d22: 0 45 85 ed test %r13d,%r13d
ffffffff804c1d25: 0 75 21 jne ffffffff804c1d48 <tcp_ack+0x1231>
ffffffff804c1d27: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c1d2d: 0 c0 e8 04 shr $0x4,%al
ffffffff804c1d30: 0 a8 02 test $0x2,%al
ffffffff804c1d32: 0 0f 84 0b 01 00 00 je ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1d38: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1d3b: 0 e8 31 d2 ff ff callq ffffffff804bef71 <tcp_head_timedout>
ffffffff804c1d40: 0 85 c0 test %eax,%eax
ffffffff804c1d42: 0 0f 84 fb 00 00 00 je ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1d48: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c1d4e: 0 c0 e8 04 shr $0x4,%al
ffffffff804c1d51: 0 75 07 jne ffffffff804c1d5a <tcp_ack+0x1243>
ffffffff804c1d53: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff804c1d58: 0 eb 31 jmp ffffffff804c1d8b <tcp_ack+0x1274>
ffffffff804c1d5a: 0 a8 02 test $0x2,%al
ffffffff804c1d5c: 0 8a 85 7f 04 00 00 mov 0x47f(%rbp),%al
ffffffff804c1d62: 0 74 17 je ffffffff804c1d7b <tcp_ack+0x1264>
ffffffff804c1d64: 0 8b b5 d4 04 00 00 mov 0x4d4(%rbp),%esi
ffffffff804c1d6a: 0 0f b6 c0 movzbl %al,%eax
ffffffff804c1d6d: 0 29 c6 sub %eax,%esi
ffffffff804c1d6f: 0 b8 01 00 00 00 mov $0x1,%eax
ffffffff804c1d74: 0 85 f6 test %esi,%esi
ffffffff804c1d76: 0 0f 4e f0 cmovle %eax,%esi
ffffffff804c1d79: 0 eb 10 jmp ffffffff804c1d8b <tcp_ack+0x1274>
ffffffff804c1d7b: 0 8b b5 d0 04 00 00 mov 0x4d0(%rbp),%esi
ffffffff804c1d81: 0 0f b6 c0 movzbl %al,%eax
ffffffff804c1d84: 0 29 c6 sub %eax,%esi
ffffffff804c1d86: 0 39 f3 cmp %esi,%ebx
ffffffff804c1d88: 0 0f 4d f3 cmovge %ebx,%esi
ffffffff804c1d8b: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1d8e: 0 e8 7e e0 ff ff callq ffffffff804bfe11 <tcp_mark_head_lost>
ffffffff804c1d93: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c1d99: 0 c0 e8 04 shr $0x4,%al
ffffffff804c1d9c: 0 a8 02 test $0x2,%al
ffffffff804c1d9e: 0 0f 84 9f 00 00 00 je ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1da4: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1da7: 0 e8 c5 d1 ff ff callq ffffffff804bef71 <tcp_head_timedout>
ffffffff804c1dac: 0 85 c0 test %eax,%eax
ffffffff804c1dae: 0 0f 84 8f 00 00 00 je ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1db4: 0 48 8b 85 e8 04 00 00 mov 0x4e8(%rbp),%rax
ffffffff804c1dbb: 0 48 85 c0 test %rax,%rax
ffffffff804c1dbe: 0 48 89 c3 mov %rax,%rbx
ffffffff804c1dc1: 0 75 42 jne ffffffff804c1e05 <tcp_ack+0x12ee>
ffffffff804c1dc3: 0 48 8b 9d c0 00 00 00 mov 0xc0(%rbp),%rbx
ffffffff804c1dca: 0 48 8d 85 c0 00 00 00 lea 0xc0(%rbp),%rax
ffffffff804c1dd1: 0 48 39 c3 cmp %rax,%rbx
ffffffff804c1dd4: 0 75 2f jne ffffffff804c1e05 <tcp_ack+0x12ee>
ffffffff804c1dd6: 0 31 db xor %ebx,%ebx
ffffffff804c1dd8: 0 eb 2b jmp ffffffff804c1e05 <tcp_ack+0x12ee>
ffffffff804c1dda: 0 48 3b 9d d8 01 00 00 cmp 0x1d8(%rbp),%rbx
ffffffff804c1de1: 0 74 34 je ffffffff804c1e17 <tcp_ack+0x1300>
ffffffff804c1de3: 0 48 8b 05 96 7a 3f 00 mov 0x3f7a96(%rip),%rax # ffffffff808b9880 <jiffies>
ffffffff804c1dea: 0 2b 43 58 sub 0x58(%rbx),%eax
ffffffff804c1ded: 0 3b 85 58 03 00 00 cmp 0x358(%rbp),%eax
ffffffff804c1df3: 0 76 22 jbe ffffffff804c1e17 <tcp_ack+0x1300>
ffffffff804c1df5: 0 48 89 de mov %rbx,%rsi
ffffffff804c1df8: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1dfb: 0 e8 28 d0 ff ff callq ffffffff804bee28 <tcp_skb_mark_lost>
ffffffff804c1e00: 0 48 8b 1b mov (%rbx),%rbx
ffffffff804c1e03: 0 eb 07 jmp ffffffff804c1e0c <tcp_ack+0x12f5>
ffffffff804c1e05: 0 4c 8d ad c0 00 00 00 lea 0xc0(%rbp),%r13
ffffffff804c1e0c: 0 48 8b 03 mov (%rbx),%rax
ffffffff804c1e0f: 0 4c 39 eb cmp %r13,%rbx
ffffffff804c1e12: 0 0f 18 08 prefetcht0 (%rax)
ffffffff804c1e15: 0 75 c3 jne ffffffff804c1dda <tcp_ack+0x12c3>
ffffffff804c1e17: 0 8b 85 cc 04 00 00 mov 0x4cc(%rbp),%eax
ffffffff804c1e1d: 0 03 85 d0 04 00 00 add 0x4d0(%rbp),%eax
ffffffff804c1e23: 0 3b 85 74 04 00 00 cmp 0x474(%rbp),%eax
ffffffff804c1e29: 0 48 89 9d e8 04 00 00 mov %rbx,0x4e8(%rbp)
ffffffff804c1e30: 0 76 11 jbe ffffffff804c1e43 <tcp_ack+0x132c>
ffffffff804c1e32: 0 be e5 08 00 00 mov $0x8e5,%esi
ffffffff804c1e37: 0 48 c7 c7 9d d9 6a 80 mov $0xffffffff806ad99d,%rdi
ffffffff804c1e3e: 0 e8 72 43 d7 ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804c1e43: 0 44 89 e6 mov %r12d,%esi
ffffffff804c1e46: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1e49: 0 e8 9f d1 ff ff callq ffffffff804befed <tcp_cwnd_down>
ffffffff804c1e4e: 0 e9 2c 01 00 00 jmpq ffffffff804c1f7f <tcp_ack+0x1468>
ffffffff804c1e53: 47 8b 74 24 1c mov 0x1c(%rsp),%esi
ffffffff804c1e57: 513 48 89 ef mov %rbp,%rdi
ffffffff804c1e5a: 0 e8 62 d4 ff ff callq ffffffff804bf2c1 <tcp_cong_avoid>
ffffffff804c1e5f: 427 41 80 e4 34 and $0x34,%r12b
ffffffff804c1e63: 1234 75 07 jne ffffffff804c1e6c <tcp_ack+0x1355>
ffffffff804c1e65: 0 83 7c 24 54 00 cmpl $0x0,0x54(%rsp)
ffffffff804c1e6a: 0 75 3c jne ffffffff804c1ea8 <tcp_ack+0x1391>
ffffffff804c1e6c: 0 48 8b 7d 78 mov 0x78(%rbp),%rdi
ffffffff804c1e70: 916 e8 8d c9 ff ff callq ffffffff804be802 <dst_confirm>
ffffffff804c1e75: 3 eb 31 jmp ffffffff804c1ea8 <tcp_ack+0x1391>
ffffffff804c1e77: 0 48 8b 95 d8 01 00 00 mov 0x1d8(%rbp),%rdx
ffffffff804c1e7e: 99 48 85 d2 test %rdx,%rdx
ffffffff804c1e81: 16 74 25 je ffffffff804c1ea8 <tcp_ack+0x1391>
ffffffff804c1e83: 0 8b 85 44 04 00 00 mov 0x444(%rbp),%eax
ffffffff804c1e89: 0 03 85 00 04 00 00 add 0x400(%rbp),%eax
ffffffff804c1e8f: 0 3b 42 54 cmp 0x54(%rdx),%eax
ffffffff804c1e92: 0 78 1e js ffffffff804c1eb2 <tcp_ack+0x139b>
ffffffff804c1e94: 0 c6 85 7b 03 00 00 00 movb $0x0,0x37b(%rbp)
ffffffff804c1e9b: 0 be 03 00 00 00 mov $0x3,%esi
ffffffff804c1ea0: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1ea3: 0 e8 ab c9 ff ff callq ffffffff804be853 <inet_csk_clear_xmit_timer>
ffffffff804c1ea8: 520 b8 01 00 00 00 mov $0x1,%eax
ffffffff804c1ead: 994 e9 ec 00 00 00 jmpq ffffffff804c1f9e <tcp_ack+0x1487>
ffffffff804c1eb2: 0 0f b6 8d 7b 03 00 00 movzbl 0x37b(%rbp),%ecx
ffffffff804c1eb9: 0 8b 95 58 03 00 00 mov 0x358(%rbp),%edx
ffffffff804c1ebf: 0 b8 30 75 00 00 mov $0x7530,%eax
ffffffff804c1ec4: 0 be 03 00 00 00 mov $0x3,%esi
ffffffff804c1ec9: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1ecc: 0 d3 e2 shl %cl,%edx
ffffffff804c1ece: 0 b9 30 75 00 00 mov $0x7530,%ecx
ffffffff804c1ed3: 0 81 fa 30 75 00 00 cmp $0x7530,%edx
ffffffff804c1ed9: 0 0f 47 d0 cmova %eax,%edx
ffffffff804c1edc: 0 89 d2 mov %edx,%edx
ffffffff804c1ede: 0 e8 00 d7 ff ff callq ffffffff804bf5e3 <inet_csk_reset_xmit_timer>
ffffffff804c1ee3: 0 eb c3 jmp ffffffff804c1ea8 <tcp_ack+0x1391>
ffffffff804c1ee5: 0 80 78 25 00 cmpb $0x0,0x25(%rax)
ffffffff804c1ee9: 0 74 1a je ffffffff804c1f05 <tcp_ack+0x13ee>
ffffffff804c1eeb: 0 8b 54 24 18 mov 0x18(%rsp),%edx
ffffffff804c1eef: 0 e8 cc e3 ff ff callq ffffffff804c02c0 <tcp_sacktag_write_queue>
ffffffff804c1ef4: 0 80 bd 78 03 00 00 00 cmpb $0x0,0x378(%rbp)
ffffffff804c1efb: 0 75 08 jne ffffffff804c1f05 <tcp_ack+0x13ee>
ffffffff804c1efd: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1f00: 0 e8 68 d3 ff ff callq ffffffff804bf26d <tcp_try_keep_open>
ffffffff804c1f05: 0 48 85 ed test %rbp,%rbp
ffffffff804c1f08: 0 74 2f je ffffffff804c1f39 <tcp_ack+0x1422>
ffffffff804c1f0a: 0 be 0a 00 00 00 mov $0xa,%esi
ffffffff804c1f0f: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1f12: 0 e8 f1 d5 ff ff callq ffffffff804bf508 <sock_flag>
ffffffff804c1f17: 0 85 c0 test %eax,%eax
ffffffff804c1f19: 0 74 1e je ffffffff804c1f39 <tcp_ack+0x1422>
ffffffff804c1f1b: 0 8b 8d fc 03 00 00 mov 0x3fc(%rbp),%ecx
ffffffff804c1f21: 0 8b 95 00 04 00 00 mov 0x400(%rbp),%edx
ffffffff804c1f27: 0 48 c7 c7 e5 d9 6a 80 mov $0xffffffff806ad9e5,%rdi
ffffffff804c1f2e: 0 8b 74 24 1c mov 0x1c(%rsp),%esi
ffffffff804c1f32: 0 31 c0 xor %eax,%eax
ffffffff804c1f34: 0 e8 3b 4e d7 ff callq ffffffff80236d74 <printk>
ffffffff804c1f39: 0 31 c0 xor %eax,%eax
ffffffff804c1f3b: 0 eb 61 jmp ffffffff804c1f9e <tcp_ack+0x1487>
ffffffff804c1f3d: 0 c7 44 24 44 00 00 00 movl $0x0,0x44(%rsp)
ffffffff804c1f44: 0 00
ffffffff804c1f45: 0 e9 c3 ef ff ff jmpq ffffffff804c0f0d <tcp_ack+0x3f6>
ffffffff804c1f4a: 54 41 f6 c4 04 test $0x4,%r12b
ffffffff804c1f4e: 424 0f 84 0b ff ff ff je ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1f54: 364 85 c9 test %ecx,%ecx
ffffffff804c1f56: 0 0f 84 f7 fe ff ff je ffffffff804c1e53 <tcp_ack+0x133c>
ffffffff804c1f5c: 0 e9 fe fe ff ff jmpq ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1f61: 0 8a 85 9c 04 00 00 mov 0x49c(%rbp),%al
ffffffff804c1f67: 0 c0 e8 04 shr $0x4,%al
ffffffff804c1f6a: 0 a8 02 test $0x2,%al
ffffffff804c1f6c: 0 0f 85 10 f8 ff ff jne ffffffff804c1782 <tcp_ack+0xc6b>
ffffffff804c1f72: 0 e9 61 f8 ff ff jmpq ffffffff804c17d8 <tcp_ack+0xcc1>
ffffffff804c1f77: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1f7a: 0 e8 2e d0 ff ff callq ffffffff804befad <tcp_moderate_cwnd>
ffffffff804c1f7f: 0 48 89 ef mov %rbp,%rdi
ffffffff804c1f82: 0 e8 f7 47 00 00 callq ffffffff804c677e <tcp_xmit_retransmit_queue>
ffffffff804c1f87: 0 e9 d3 fe ff ff jmpq ffffffff804c1e5f <tcp_ack+0x1348>
ffffffff804c1f8c: 0 80 bd 78 03 00 00 01 cmpb $0x1,0x378(%rbp)
ffffffff804c1f93: 0 0f 87 be fc ff ff ja ffffffff804c1c57 <tcp_ack+0x1140>
ffffffff804c1f99: 0 e9 7b fc ff ff jmpq ffffffff804c1c19 <tcp_ack+0x1102>
ffffffff804c1f9e: 493 48 81 c4 88 00 00 00 add $0x88,%rsp
ffffffff804c1fa5: 1288 5b pop %rbx
ffffffff804c1fa6: 0 5d pop %rbp
ffffffff804c1fa7: 446 41 5c pop %r12
ffffffff804c1fa9: 0 41 5d pop %r13
ffffffff804c1fab: 2 41 5e pop %r14
ffffffff804c1fad: 447 41 5f pop %r15
ffffffff804c1faf: 0 c3 retq

No real obvious single-instruction hotspots i can see.

But i can see another problem: the function is too large and its flow
is not fall-through in any way. As you can see it from the profile
distribution it is broken into 25-30 separate code sequences.

The function consists of more than 1200 instructions and is 5200 bytes
large. According to the profile above, only 350 instructions are used
and about 850 of those instructions are never used by this workload.
So in theory this function should only take up ~1.5K of the
instruction cache.

But because execution is spread out into 25+ smaller pieces, it takes
up ~4K of the instruction cache instead (there's a single ~1.2K hole
in the middle, i subtracted that) - 2-3 times larger than it should.

So this code could make good use of the (brand-new ;-) branch-tracer
ftrace plugin and grow a few well-placed likely()/unlikely() places -
at least for this workload. I think.

Ingo

2008-11-17 21:20:07

by Ingo Molnar

[permalink] [raw]
Subject: tcp_recvmsg(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

> 100.000000 total
> ................
> 1.833688 tcp_recvmsg

hits (total: 183368)
.........
ffffffff804bd46e: 882 <tcp_recvmsg>:
ffffffff804bd46e: 882 41 57 push %r15
ffffffff804bd470: 15507 48 89 f7 mov %rsi,%rdi
ffffffff804bd473: 179 41 56 push %r14
ffffffff804bd475: 0 49 89 ce mov %rcx,%r14
ffffffff804bd478: 744 41 55 push %r13
ffffffff804bd47a: 165 41 54 push %r12
ffffffff804bd47c: 0 45 89 c4 mov %r8d,%r12d
ffffffff804bd47f: 692 55 push %rbp
ffffffff804bd480: 178 44 89 cd mov %r9d,%ebp
ffffffff804bd483: 3434 53 push %rbx
ffffffff804bd484: 685 48 89 f3 mov %rsi,%rbx
ffffffff804bd487: 11 48 83 ec 68 sub $0x68,%rsp
ffffffff804bd48b: 949 48 89 54 24 30 mov %rdx,0x30(%rsp)
ffffffff804bd490: 7 e8 e8 e8 ff ff callq ffffffff804bbd7d <lock_sock>
ffffffff804bd495: 1771 8a 43 02 mov 0x2(%rbx),%al
ffffffff804bd498: 6176 3c 0a cmp $0xa,%al
ffffffff804bd49a: 0 0f 84 3a 06 00 00 je ffffffff804bdada <tcp_recvmsg+0x66c>
ffffffff804bd4a0: 3121 31 c0 xor %eax,%eax
ffffffff804bd4a2: 195 45 85 e4 test %r12d,%r12d
ffffffff804bd4a5: 0 75 07 jne ffffffff804bd4ae <tcp_recvmsg+0x40>
ffffffff804bd4a7: 926 48 8b 83 68 01 00 00 mov 0x168(%rbx),%rax
ffffffff804bd4ae: 189 40 f6 c5 01 test $0x1,%bpl
ffffffff804bd4b2: 0 48 89 44 24 58 mov %rax,0x58(%rsp)
ffffffff804bd4b7: 819 0f 85 33 06 00 00 jne ffffffff804bdaf0 <tcp_recvmsg+0x682>
ffffffff804bd4bd: 216 89 e8 mov %ebp,%eax
ffffffff804bd4bf: 0 83 e0 02 and $0x2,%eax
ffffffff804bd4c2: 638 89 44 24 3c mov %eax,0x3c(%rsp)
ffffffff804bd4c6: 177 75 0e jne ffffffff804bd4d6 <tcp_recvmsg+0x68>
ffffffff804bd4c8: 0 48 8d 93 f4 03 00 00 lea 0x3f4(%rbx),%rdx
ffffffff804bd4cf: 661 48 89 54 24 40 mov %rdx,0x40(%rsp)
ffffffff804bd4d4: 195 eb 14 jmp ffffffff804bd4ea <tcp_recvmsg+0x7c>
ffffffff804bd4d6: 0 8b 83 f4 03 00 00 mov 0x3f4(%rbx),%eax
ffffffff804bd4dc: 0 48 8d 4c 24 60 lea 0x60(%rsp),%rcx
ffffffff804bd4e1: 0 48 89 4c 24 40 mov %rcx,0x40(%rsp)
ffffffff804bd4e6: 0 89 44 24 60 mov %eax,0x60(%rsp)
ffffffff804bd4ea: 867 89 ee mov %ebp,%esi
ffffffff804bd4ec: 210 44 89 f2 mov %r14d,%edx
ffffffff804bd4ef: 0 48 89 df mov %rbx,%rdi
ffffffff804bd4f2: 894 81 e6 00 01 00 00 and $0x100,%esi
ffffffff804bd4f8: 192 45 31 ff xor %r15d,%r15d
ffffffff804bd4fb: 0 e8 fc df ff ff callq ffffffff804bb4fc <sock_rcvlowat>
ffffffff804bd500: 853 89 44 24 4c mov %eax,0x4c(%rsp)
ffffffff804bd504: 1857 48 8d 83 a8 00 00 00 lea 0xa8(%rbx),%rax
ffffffff804bd50b: 0 89 e9 mov %ebp,%ecx
ffffffff804bd50d: 595 48 8d 93 10 04 00 00 lea 0x410(%rbx),%rdx
ffffffff804bd514: 263 83 e1 22 and $0x22,%ecx
ffffffff804bd517: 0 83 e5 20 and $0x20,%ebp
ffffffff804bd51a: 601 48 89 44 24 28 mov %rax,0x28(%rsp)
ffffffff804bd51f: 254 48 8d 83 f8 04 00 00 lea 0x4f8(%rbx),%rax
ffffffff804bd526: 2 48 c7 44 24 50 00 00 movq $0x0,0x50(%rsp)
ffffffff804bd52d: 0 00 00
ffffffff804bd52f: 578 48 89 54 24 20 mov %rdx,0x20(%rsp)
ffffffff804bd534: 290 89 4c 24 1c mov %ecx,0x1c(%rsp)
ffffffff804bd538: 1 48 89 44 24 10 mov %rax,0x10(%rsp)
ffffffff804bd53d: 593 89 6c 24 0c mov %ebp,0xc(%rsp)
ffffffff804bd541: 568 66 83 bb 7c 04 00 00 cmpw $0x0,0x47c(%rbx)
ffffffff804bd548: 0 00
ffffffff804bd549: 3956 74 55 je ffffffff804bd5a0 <tcp_recvmsg+0x132>
ffffffff804bd54b: 0 48 8b 54 24 40 mov 0x40(%rsp),%rdx
ffffffff804bd550: 0 8b 83 84 05 00 00 mov 0x584(%rbx),%eax
ffffffff804bd556: 0 3b 02 cmp (%rdx),%eax
ffffffff804bd558: 0 75 46 jne ffffffff804bd5a0 <tcp_recvmsg+0x132>
ffffffff804bd55a: 0 45 85 ff test %r15d,%r15d
ffffffff804bd55d: 0 0f 85 e6 04 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd563: 0 65 48 8b 3c 25 00 00 mov %gs:0x0,%rdi
ffffffff804bd56a: 0 00 00
ffffffff804bd56c: 0 e8 4c e1 ff ff callq ffffffff804bb6bd <signal_pending>
ffffffff804bd571: 0 85 c0 test %eax,%eax
ffffffff804bd573: 0 74 2b je ffffffff804bd5a0 <tcp_recvmsg+0x132>
ffffffff804bd575: 0 48 8b 54 24 58 mov 0x58(%rsp),%rdx
ffffffff804bd57a: 0 41 bf f5 ff ff ff mov $0xfffffff5,%r15d
ffffffff804bd580: 0 48 85 d2 test %rdx,%rdx
ffffffff804bd583: 0 0f 84 c0 04 00 00 je ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd589: 0 48 b8 ff ff ff ff ff mov $0x7fffffffffffffff,%rax
ffffffff804bd590: 0 ff ff 7f
ffffffff804bd593: 0 66 41 bf 00 fe mov $0xfe00,%r15w
ffffffff804bd598: 0 48 39 c2 cmp %rax,%rdx
ffffffff804bd59b: 0 e9 89 01 00 00 jmpq ffffffff804bd729 <tcp_recvmsg+0x2bb>
ffffffff804bd5a0: 597 48 8b ab a8 00 00 00 mov 0xa8(%rbx),%rbp
ffffffff804bd5a7: 4601 48 3b 6c 24 28 cmp 0x28(%rsp),%rbp
ffffffff804bd5ac: 1 b8 00 00 00 00 mov $0x0,%eax
ffffffff804bd5b1: 1769 48 0f 44 e8 cmove %rax,%rbp
ffffffff804bd5b5: 473 48 85 ed test %rbp,%rbp
ffffffff804bd5b8: 0 74 76 je ffffffff804bd630 <tcp_recvmsg+0x1c2>
ffffffff804bd5ba: 595 48 8b 4c 24 40 mov 0x40(%rsp),%rcx
ffffffff804bd5bf: 897 8b 55 50 mov 0x50(%rbp),%edx
ffffffff804bd5c2: 89 8b 31 mov (%rcx),%esi
ffffffff804bd5c4: 581 41 89 f5 mov %esi,%r13d
ffffffff804bd5c7: 301 41 29 d5 sub %edx,%r13d
ffffffff804bd5ca: 33 79 10 jns ffffffff804bd5dc <tcp_recvmsg+0x16e>
ffffffff804bd5cc: 0 48 c7 c7 48 d9 6a 80 mov $0xffffffff806ad948,%rdi
ffffffff804bd5d3: 0 31 c0 xor %eax,%eax
ffffffff804bd5d5: 0 e8 9a 97 d7 ff callq ffffffff80236d74 <printk>
ffffffff804bd5da: 0 eb 54 jmp ffffffff804bd630 <tcp_recvmsg+0x1c2>
ffffffff804bd5dc: 584 8b 85 b8 00 00 00 mov 0xb8(%rbp),%eax
ffffffff804bd5e2: 1061 48 8b 95 d0 00 00 00 mov 0xd0(%rbp),%rdx
ffffffff804bd5e9: 1 8a 54 02 0d mov 0xd(%rdx,%rax,1),%dl
ffffffff804bd5ed: 0 88 d0 mov %dl,%al
ffffffff804bd5ef: 876 83 e0 02 and $0x2,%eax
ffffffff804bd5f2: 0 3c 01 cmp $0x1,%al
ffffffff804bd5f4: 0 8b 45 68 mov 0x68(%rbp),%eax
ffffffff804bd5f7: 909 41 83 d5 ff adc $0xffffffffffffffff,%r13d
ffffffff804bd5fb: 0 41 39 c5 cmp %eax,%r13d
ffffffff804bd5fe: 0 0f 82 df 02 00 00 jb ffffffff804bd8e3 <tcp_recvmsg+0x475>
ffffffff804bd604: 0 80 e2 01 and $0x1,%dl
ffffffff804bd607: 0 0f 85 16 04 00 00 jne ffffffff804bda23 <tcp_recvmsg+0x5b5>
ffffffff804bd60d: 0 83 7c 24 3c 00 cmpl $0x0,0x3c(%rsp)
ffffffff804bd612: 0 75 11 jne ffffffff804bd625 <tcp_recvmsg+0x1b7>
ffffffff804bd614: 0 be 53 05 00 00 mov $0x553,%esi
ffffffff804bd619: 0 48 c7 c7 13 d9 6a 80 mov $0xffffffff806ad913,%rdi
ffffffff804bd620: 0 e8 90 8b d7 ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804bd625: 0 48 8b 6d 00 mov 0x0(%rbp),%rbp
ffffffff804bd629: 0 48 3b 6c 24 28 cmp 0x28(%rsp),%rbp
ffffffff804bd62e: 0 75 85 jne ffffffff804bd5b5 <tcp_recvmsg+0x147>
ffffffff804bd630: 80 44 3b 7c 24 4c cmp 0x4c(%rsp),%r15d
ffffffff804bd635: 4164 7c 0b jl ffffffff804bd642 <tcp_recvmsg+0x1d4>
ffffffff804bd637: 0 48 83 7b 68 00 cmpq $0x0,0x68(%rbx)
ffffffff804bd63c: 0 0f 84 07 04 00 00 je ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd642: 1 45 85 ff test %r15d,%r15d
ffffffff804bd645: 3438 74 49 je ffffffff804bd690 <tcp_recvmsg+0x222>
ffffffff804bd647: 0 83 bb 44 01 00 00 00 cmpl $0x0,0x144(%rbx)
ffffffff804bd64e: 0 0f 85 f5 03 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd654: 0 8a 43 02 mov 0x2(%rbx),%al
ffffffff804bd657: 0 3c 07 cmp $0x7,%al
ffffffff804bd659: 0 0f 84 ea 03 00 00 je ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd65f: 0 f6 43 38 01 testb $0x1,0x38(%rbx)
ffffffff804bd663: 0 0f 85 e0 03 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd669: 0 48 83 7c 24 58 00 cmpq $0x0,0x58(%rsp)
ffffffff804bd66f: 0 0f 84 d4 03 00 00 je ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd675: 0 65 48 8b 3c 25 00 00 mov %gs:0x0,%rdi
ffffffff804bd67c: 0 00 00
ffffffff804bd67e: 0 e8 3a e0 ff ff callq ffffffff804bb6bd <signal_pending>
ffffffff804bd683: 0 85 c0 test %eax,%eax
ffffffff804bd685: 0 0f 84 ac 00 00 00 je ffffffff804bd737 <tcp_recvmsg+0x2c9>
ffffffff804bd68b: 0 e9 b9 03 00 00 jmpq ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd690: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff804bd695: 4166 48 89 df mov %rbx,%rdi
ffffffff804bd698: 0 e8 7b de ff ff callq ffffffff804bb518 <sock_flag>
ffffffff804bd69d: 0 85 c0 test %eax,%eax
ffffffff804bd69f: 276 0f 85 a4 03 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6a5: 126 83 bb 44 01 00 00 00 cmpl $0x0,0x144(%rbx)
ffffffff804bd6ac: 0 74 10 je ffffffff804bd6be <tcp_recvmsg+0x250>
ffffffff804bd6ae: 0 48 89 df mov %rbx,%rdi
ffffffff804bd6b1: 0 e8 00 df ff ff callq ffffffff804bb5b6 <sock_error>
ffffffff804bd6b6: 0 41 89 c7 mov %eax,%r15d
ffffffff804bd6b9: 0 e9 8b 03 00 00 jmpq ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6be: 112 f6 43 38 01 testb $0x1,0x38(%rbx)
ffffffff804bd6c2: 3451 0f 85 81 03 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6c8: 497 8a 43 02 mov 0x2(%rbx),%al
ffffffff804bd6cb: 0 3c 07 cmp $0x7,%al
ffffffff804bd6cd: 113 75 20 jne ffffffff804bd6ef <tcp_recvmsg+0x281>
ffffffff804bd6cf: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff804bd6d4: 0 48 89 df mov %rbx,%rdi
ffffffff804bd6d7: 0 e8 3c de ff ff callq ffffffff804bb518 <sock_flag>
ffffffff804bd6dc: 0 85 c0 test %eax,%eax
ffffffff804bd6de: 0 0f 85 65 03 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6e4: 0 41 bf 95 ff ff ff mov $0xffffff95,%r15d
ffffffff804bd6ea: 0 e9 5a 03 00 00 jmpq ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd6ef: 118 48 83 7c 24 58 00 cmpq $0x0,0x58(%rsp)
ffffffff804bd6f5: 398 75 0b jne ffffffff804bd702 <tcp_recvmsg+0x294>
ffffffff804bd6f7: 0 41 bf f5 ff ff ff mov $0xfffffff5,%r15d
ffffffff804bd6fd: 0 e9 47 03 00 00 jmpq ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd702: 0 65 48 8b 3c 25 00 00 mov %gs:0x0,%rdi
ffffffff804bd709: 0 00 00
ffffffff804bd70b: 2993 e8 ad df ff ff callq ffffffff804bb6bd <signal_pending>
ffffffff804bd710: 200 85 c0 test %eax,%eax
ffffffff804bd712: 0 74 23 je ffffffff804bd737 <tcp_recvmsg+0x2c9>
ffffffff804bd714: 0 48 b8 ff ff ff ff ff mov $0x7fffffffffffffff,%rax
ffffffff804bd71b: 0 ff ff 7f
ffffffff804bd71e: 0 48 39 44 24 58 cmp %rax,0x58(%rsp)
ffffffff804bd723: 0 41 bf 00 fe ff ff mov $0xfffffe00,%r15d
ffffffff804bd729: 0 b8 fc ff ff ff mov $0xfffffffc,%eax
ffffffff804bd72e: 0 44 0f 45 f8 cmovne %eax,%r15d
ffffffff804bd732: 0 e9 12 03 00 00 jmpq ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd737: 207 44 89 fe mov %r15d,%esi
ffffffff804bd73a: 198 48 89 df mov %rbx,%rdi
ffffffff804bd73d: 0 e8 cc e9 ff ff callq ffffffff804bc10e <tcp_cleanup_rbuf>
ffffffff804bd742: 227 83 3d 9b ad 3f 00 00 cmpl $0x0,0x3fad9b(%rip) # ffffffff808b84e4 <sysctl_tcp_low_latency>
ffffffff804bd749: 210 0f 85 81 00 00 00 jne ffffffff804bd7d0 <tcp_recvmsg+0x362>
ffffffff804bd74f: 0 48 8b ab 28 04 00 00 mov 0x428(%rbx),%rbp
ffffffff804bd756: 0 48 3b 6c 24 50 cmp 0x50(%rsp),%rbp
ffffffff804bd75b: 232 75 73 jne ffffffff804bd7d0 <tcp_recvmsg+0x362>
ffffffff804bd75d: 0 48 83 7c 24 50 00 cmpq $0x0,0x50(%rsp)
ffffffff804bd763: 7 75 27 jne ffffffff804bd78c <tcp_recvmsg+0x31e>
ffffffff804bd765: 229 83 7c 24 1c 00 cmpl $0x0,0x1c(%rsp)
ffffffff804bd76a: 30 75 20 jne ffffffff804bd78c <tcp_recvmsg+0x31e>
ffffffff804bd76c: 7 48 8b 54 24 30 mov 0x30(%rsp),%rdx
ffffffff804bd771: 191 65 48 8b 2c 25 00 00 mov %gs:0x0,%rbp
ffffffff804bd778: 0 00 00
ffffffff804bd77a: 12 48 89 ab 28 04 00 00 mov %rbp,0x428(%rbx)
ffffffff804bd781: 2617 48 8b 42 10 mov 0x10(%rdx),%rax
ffffffff804bd785: 670 48 89 83 30 04 00 00 mov %rax,0x430(%rbx)
ffffffff804bd78c: 11 8b 83 f4 03 00 00 mov 0x3f4(%rbx),%eax
ffffffff804bd792: 188 3b 83 f0 03 00 00 cmp 0x3f0(%rbx),%eax
ffffffff804bd798: 166 44 89 b3 3c 04 00 00 mov %r14d,0x43c(%rbx)
ffffffff804bd79f: 5 74 18 je ffffffff804bd7b9 <tcp_recvmsg+0x34b>
ffffffff804bd7a1: 0 83 7c 24 1c 00 cmpl $0x0,0x1c(%rsp)
ffffffff804bd7a6: 0 75 11 jne ffffffff804bd7b9 <tcp_recvmsg+0x34b>
ffffffff804bd7a8: 0 be 92 05 00 00 mov $0x592,%esi
ffffffff804bd7ad: 0 48 c7 c7 13 d9 6a 80 mov $0xffffffff806ad913,%rdi
ffffffff804bd7b4: 0 e8 fc 89 d7 ff callq ffffffff802361b5 <warn_on_slowpath>
ffffffff804bd7b9: 336 48 8b 4c 24 20 mov 0x20(%rsp),%rcx
ffffffff804bd7be: 302 48 39 8b 10 04 00 00 cmp %rcx,0x410(%rbx)
ffffffff804bd7c5: 1176 48 89 6c 24 50 mov %rbp,0x50(%rsp)
ffffffff804bd7ca: 244 0f 85 81 00 00 00 jne ffffffff804bd851 <tcp_recvmsg+0x3e3>
ffffffff804bd7d0: 135 44 3b 7c 24 4c cmp 0x4c(%rsp),%r15d
ffffffff804bd7d5: 112 7c 12 jl ffffffff804bd7e9 <tcp_recvmsg+0x37b>
ffffffff804bd7d7: 0 48 89 df mov %rbx,%rdi
ffffffff804bd7da: 0 e8 57 7f fc ff callq ffffffff80485736 <release_sock>
ffffffff804bd7df: 0 48 89 df mov %rbx,%rdi
ffffffff804bd7e2: 0 e8 96 e5 ff ff callq ffffffff804bbd7d <lock_sock>
ffffffff804bd7e7: 0 eb 0d jmp ffffffff804bd7f6 <tcp_recvmsg+0x388>
ffffffff804bd7e9: 152 48 8d 74 24 58 lea 0x58(%rsp),%rsi
ffffffff804bd7ee: 563 48 89 df mov %rbx,%rdi
ffffffff804bd7f1: 59 e8 83 99 fc ff callq ffffffff80487179 <sk_wait_data>
ffffffff804bd7f6: 86 48 83 7c 24 50 00 cmpq $0x0,0x50(%rsp)
ffffffff804bd7fc: 8550 0f 84 8a 00 00 00 je ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd802: 4038 44 89 f1 mov %r14d,%ecx
ffffffff804bd805: 900 2b 8b 3c 04 00 00 sub 0x43c(%rbx),%ecx
ffffffff804bd80b: 5 74 28 je ffffffff804bd835 <tcp_recvmsg+0x3c7>
ffffffff804bd80d: 0 48 8b 05 ac 3e 5f 00 mov 0x5f3eac(%rip),%rax # ffffffff80ab16c0 <init_net+0xf0>
ffffffff804bd814: 1 41 01 cf add %ecx,%r15d
ffffffff804bd817: 0 65 8b 14 25 24 00 00 mov %gs:0x24,%edx
ffffffff804bd81e: 0 00
ffffffff804bd81f: 0 89 d2 mov %edx,%edx
ffffffff804bd821: 0 48 f7 d0 not %rax
ffffffff804bd824: 0 48 8b 04 d0 mov (%rax,%rdx,8),%rax
ffffffff804bd828: 0 48 63 d1 movslq %ecx,%rdx
ffffffff804bd82b: 0 49 29 d6 sub %rdx,%r14
ffffffff804bd82e: 0 48 01 90 b8 00 00 00 add %rdx,0xb8(%rax)
ffffffff804bd835: 4 8b 83 f0 03 00 00 mov 0x3f0(%rbx),%eax
ffffffff804bd83b: 373 3b 83 f4 03 00 00 cmp 0x3f4(%rbx),%eax
ffffffff804bd841: 3604 75 49 jne ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd843: 0 48 8b 44 24 20 mov 0x20(%rsp),%rax
ffffffff804bd848: 971 48 39 83 10 04 00 00 cmp %rax,0x410(%rbx)
ffffffff804bd84f: 11 74 3b je ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd851: 6 48 89 df mov %rbx,%rdi
ffffffff804bd854: 267 e8 94 e6 ff ff callq ffffffff804bbeed <tcp_prequeue_process>
ffffffff804bd859: 0 44 89 f1 mov %r14d,%ecx
ffffffff804bd85c: 879 2b 8b 3c 04 00 00 sub 0x43c(%rbx),%ecx
ffffffff804bd862: 256 74 28 je ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd864: 0 48 8b 05 55 3e 5f 00 mov 0x5f3e55(%rip),%rax # ffffffff80ab16c0 <init_net+0xf0>
ffffffff804bd86b: 116 41 01 cf add %ecx,%r15d
ffffffff804bd86e: 17 65 8b 14 25 24 00 00 mov %gs:0x24,%edx
ffffffff804bd875: 0 00
ffffffff804bd876: 0 89 d2 mov %edx,%edx
ffffffff804bd878: 1 48 f7 d0 not %rax
ffffffff804bd87b: 5 48 8b 04 d0 mov (%rax,%rdx,8),%rax
ffffffff804bd87f: 0 48 63 d1 movslq %ecx,%rdx
ffffffff804bd882: 6 49 29 d6 sub %rdx,%r14
ffffffff804bd885: 7 48 01 90 c0 00 00 00 add %rdx,0xc0(%rax)
ffffffff804bd88c: 11 83 7c 24 3c 00 cmpl $0x0,0x3c(%rsp)
ffffffff804bd891: 438 0f 84 a9 01 00 00 je ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bd897: 0 8b 44 24 60 mov 0x60(%rsp),%eax
ffffffff804bd89b: 0 3b 83 f4 03 00 00 cmp 0x3f4(%rbx),%eax
ffffffff804bd8a1: 0 0f 84 99 01 00 00 je ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bd8a7: 0 e8 19 ad fd ff callq ffffffff804985c5 <net_ratelimit>
ffffffff804bd8ac: 0 85 c0 test %eax,%eax
ffffffff804bd8ae: 0 74 24 je ffffffff804bd8d4 <tcp_recvmsg+0x466>
ffffffff804bd8b0: 0 65 48 8b 34 25 00 00 mov %gs:0x0,%rsi
ffffffff804bd8b7: 0 00 00
ffffffff804bd8b9: 0 8b 96 70 01 00 00 mov 0x170(%rsi),%edx
ffffffff804bd8bf: 0 48 c7 c7 6a d9 6a 80 mov $0xffffffff806ad96a,%rdi
ffffffff804bd8c6: 0 48 81 c6 68 03 00 00 add $0x368,%rsi
ffffffff804bd8cd: 0 31 c0 xor %eax,%eax
ffffffff804bd8cf: 0 e8 a0 94 d7 ff callq ffffffff80236d74 <printk>
ffffffff804bd8d4: 0 8b 83 f4 03 00 00 mov 0x3f4(%rbx),%eax
ffffffff804bd8da: 0 89 44 24 60 mov %eax,0x60(%rsp)
ffffffff804bd8de: 0 e9 5d 01 00 00 jmpq ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bd8e3: 4077 44 29 e8 sub %r13d,%eax
ffffffff804bd8e6: 6031 4d 89 f4 mov %r14,%r12
ffffffff804bd8e9: 0 4c 39 f0 cmp %r14,%rax
ffffffff804bd8ec: 0 4c 0f 46 e0 cmovbe %rax,%r12
ffffffff804bd8f0: 934 66 83 bb 7c 04 00 00 cmpw $0x0,0x47c(%rbx)
ffffffff804bd8f7: 0 00
ffffffff804bd8f8: 0 74 38 je ffffffff804bd932 <tcp_recvmsg+0x4c4>
ffffffff804bd8fa: 0 8b 83 84 05 00 00 mov 0x584(%rbx),%eax
ffffffff804bd900: 0 29 f0 sub %esi,%eax
ffffffff804bd902: 0 89 c2 mov %eax,%edx
ffffffff804bd904: 0 4c 39 e2 cmp %r12,%rdx
ffffffff804bd907: 0 73 29 jae ffffffff804bd932 <tcp_recvmsg+0x4c4>
ffffffff804bd909: 0 85 c0 test %eax,%eax
ffffffff804bd90b: 0 74 05 je ffffffff804bd912 <tcp_recvmsg+0x4a4>
ffffffff804bd90d: 0 49 89 d4 mov %rdx,%r12
ffffffff804bd910: 0 eb 20 jmp ffffffff804bd932 <tcp_recvmsg+0x4c4>
ffffffff804bd912: 0 be 02 00 00 00 mov $0x2,%esi
ffffffff804bd917: 0 48 89 df mov %rbx,%rdi
ffffffff804bd91a: 0 e8 f9 db ff ff callq ffffffff804bb518 <sock_flag>
ffffffff804bd91f: 0 85 c0 test %eax,%eax
ffffffff804bd921: 0 75 0f jne ffffffff804bd932 <tcp_recvmsg+0x4c4>
ffffffff804bd923: 0 48 8b 54 24 40 mov 0x40(%rsp),%rdx
ffffffff804bd928: 0 41 ff c5 inc %r13d
ffffffff804bd92b: 0 ff 02 incl (%rdx)
ffffffff804bd92d: 0 49 ff cc dec %r12
ffffffff804bd930: 0 74 4c je ffffffff804bd97e <tcp_recvmsg+0x510>
ffffffff804bd932: 906 83 7c 24 0c 00 cmpl $0x0,0xc(%rsp)
ffffffff804bd937: 6039 75 2f jne ffffffff804bd968 <tcp_recvmsg+0x4fa>
ffffffff804bd939: 48 48 8b 4c 24 30 mov 0x30(%rsp),%rcx
ffffffff804bd93e: 1412 44 89 ee mov %r13d,%esi
ffffffff804bd941: 6648 48 89 ef mov %rbp,%rdi
ffffffff804bd944: 0 48 8b 51 10 mov 0x10(%rcx),%rdx
ffffffff804bd948: 1524 44 89 e1 mov %r12d,%ecx
ffffffff804bd94b: 167 e8 c5 d3 fc ff callq ffffffff8048ad15 <skb_copy_datagram_iovec>
ffffffff804bd950: 0 85 c0 test %eax,%eax
ffffffff804bd952: 1038 74 14 je ffffffff804bd968 <tcp_recvmsg+0x4fa>
ffffffff804bd954: 0 45 85 ff test %r15d,%r15d
ffffffff804bd957: 0 0f 85 ec 00 00 00 jne ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd95d: 0 41 bf f2 ff ff ff mov $0xfffffff2,%r15d
ffffffff804bd963: 0 e9 e1 00 00 00 jmpq ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bd968: 28 48 8b 54 24 40 mov 0x40(%rsp),%rdx
ffffffff804bd96d: 5713 48 89 df mov %rbx,%rdi
ffffffff804bd970: 241 45 01 e7 add %r12d,%r15d
ffffffff804bd973: 27 4d 29 e6 sub %r12,%r14
ffffffff804bd976: 626 44 01 22 add %r12d,(%rdx)
ffffffff804bd979: 221 e8 fe 11 00 00 callq ffffffff804beb7c <tcp_rcv_space_adjust>
ffffffff804bd97e: 1425 66 83 bb 7c 04 00 00 cmpw $0x0,0x47c(%rbx)
ffffffff804bd985: 0 00
ffffffff804bd986: 3430 74 63 je ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd988: 0 8b 8b f4 03 00 00 mov 0x3f4(%rbx),%ecx
ffffffff804bd98e: 0 39 8b 84 05 00 00 cmp %ecx,0x584(%rbx)
ffffffff804bd994: 0 79 55 jns ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd996: 0 48 8b 44 24 10 mov 0x10(%rsp),%rax
ffffffff804bd99b: 0 48 39 83 f8 04 00 00 cmp %rax,0x4f8(%rbx)
ffffffff804bd9a2: 0 66 c7 83 7c 04 00 00 movw $0x0,0x47c(%rbx)
ffffffff804bd9a9: 0 00 00
ffffffff804bd9ab: 0 75 3e jne ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd9ad: 0 83 bb c0 04 00 00 00 cmpl $0x0,0x4c0(%rbx)
ffffffff804bd9b4: 0 74 35 je ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd9b6: 0 8b 83 94 00 00 00 mov 0x94(%rbx),%eax
ffffffff804bd9bc: 0 3b 43 3c cmp 0x3c(%rbx),%eax
ffffffff804bd9bf: 0 7d 2a jge ffffffff804bd9eb <tcp_recvmsg+0x57d>
ffffffff804bd9c1: 0 0f b7 83 e8 03 00 00 movzwl 0x3e8(%rbx),%eax
ffffffff804bd9c8: 0 8a 8b 9d 04 00 00 mov 0x49d(%rbx),%cl
ffffffff804bd9ce: 0 8b 93 44 04 00 00 mov 0x444(%rbx),%edx
ffffffff804bd9d4: 0 83 e1 0f and $0xf,%ecx
ffffffff804bd9d7: 0 c1 e0 1a shl $0x1a,%eax
ffffffff804bd9da: 0 d3 ea shr %cl,%edx
ffffffff804bd9dc: 0 09 d0 or %edx,%eax
ffffffff804bd9de: 0 0d 00 00 10 00 or $0x100000,%eax
ffffffff804bd9e3: 0 0f c8 bswap %eax
ffffffff804bd9e5: 0 89 83 ec 03 00 00 mov %eax,0x3ec(%rbx)
ffffffff804bd9eb: 0 8b 55 68 mov 0x68(%rbp),%edx
ffffffff804bd9ee: 1655 44 89 e8 mov %r13d,%eax
ffffffff804bd9f1: 32 4c 01 e0 add %r12,%rax
ffffffff804bd9f4: 0 48 39 d0 cmp %rdx,%rax
ffffffff804bd9f7: 847 72 47 jb ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bd9f9: 0 8b 95 b8 00 00 00 mov 0xb8(%rbp),%edx
ffffffff804bd9ff: 80 48 8b 85 d0 00 00 00 mov 0xd0(%rbp),%rax
ffffffff804bda06: 441 f6 44 02 0d 01 testb $0x1,0xd(%rdx,%rax,1)
ffffffff804bda0b: 0 75 16 jne ffffffff804bda23 <tcp_recvmsg+0x5b5>
ffffffff804bda0d: 0 83 7c 24 3c 00 cmpl $0x0,0x3c(%rsp)
ffffffff804bda12: 453 75 2c jne ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bda14: 0 31 d2 xor %edx,%edx
ffffffff804bda16: 0 48 89 ee mov %rbp,%rsi
ffffffff804bda19: 477 48 89 df mov %rbx,%rdi
ffffffff804bda1c: 0 e8 0f e4 ff ff callq ffffffff804bbe30 <sk_eat_skb>
ffffffff804bda21: 562 eb 1d jmp ffffffff804bda40 <tcp_recvmsg+0x5d2>
ffffffff804bda23: 0 48 8b 54 24 40 mov 0x40(%rsp),%rdx
ffffffff804bda28: 0 ff 02 incl (%rdx)
ffffffff804bda2a: 0 83 7c 24 3c 00 cmpl $0x0,0x3c(%rsp)
ffffffff804bda2f: 0 75 18 jne ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bda31: 0 31 d2 xor %edx,%edx
ffffffff804bda33: 0 48 89 ee mov %rbp,%rsi
ffffffff804bda36: 0 48 89 df mov %rbx,%rdi
ffffffff804bda39: 0 e8 f2 e3 ff ff callq ffffffff804bbe30 <sk_eat_skb>
ffffffff804bda3e: 0 eb 09 jmp ffffffff804bda49 <tcp_recvmsg+0x5db>
ffffffff804bda40: 959 4d 85 f6 test %r14,%r14
ffffffff804bda43: 4766 0f 85 f8 fa ff ff jne ffffffff804bd541 <tcp_recvmsg+0xd3>
ffffffff804bda49: 217 48 83 7c 24 50 00 cmpq $0x0,0x50(%rsp)
ffffffff804bda4f: 2084 74 71 je ffffffff804bdac2 <tcp_recvmsg+0x654>
ffffffff804bda51: 40 48 8d 83 10 04 00 00 lea 0x410(%rbx),%rax
ffffffff804bda58: 448 48 39 83 10 04 00 00 cmp %rax,0x410(%rbx)
ffffffff804bda5f: 4 74 4c je ffffffff804bdaad <tcp_recvmsg+0x63f>
ffffffff804bda61: 0 31 c0 xor %eax,%eax
ffffffff804bda63: 0 45 85 ff test %r15d,%r15d
ffffffff804bda66: 0 48 89 df mov %rbx,%rdi
ffffffff804bda69: 0 41 0f 4f c6 cmovg %r14d,%eax
ffffffff804bda6d: 0 89 83 3c 04 00 00 mov %eax,0x43c(%rbx)
ffffffff804bda73: 0 e8 75 e4 ff ff callq ffffffff804bbeed <tcp_prequeue_process>
ffffffff804bda78: 0 45 85 ff test %r15d,%r15d
ffffffff804bda7b: 0 7e 30 jle ffffffff804bdaad <tcp_recvmsg+0x63f>
ffffffff804bda7d: 0 44 89 f1 mov %r14d,%ecx
ffffffff804bda80: 0 2b 8b 3c 04 00 00 sub 0x43c(%rbx),%ecx
ffffffff804bda86: 0 74 25 je ffffffff804bdaad <tcp_recvmsg+0x63f>
ffffffff804bda88: 0 48 8b 05 31 3c 5f 00 mov 0x5f3c31(%rip),%rax # ffffffff80ab16c0 <init_net+0xf0>
ffffffff804bda8f: 0 41 01 cf add %ecx,%r15d
ffffffff804bda92: 0 65 8b 14 25 24 00 00 mov %gs:0x24,%edx
ffffffff804bda99: 0 00
ffffffff804bda9a: 0 89 d2 mov %edx,%edx
ffffffff804bda9c: 0 48 f7 d0 not %rax
ffffffff804bda9f: 0 48 8b 14 d0 mov (%rax,%rdx,8),%rdx
ffffffff804bdaa3: 0 48 63 c1 movslq %ecx,%rax
ffffffff804bdaa6: 0 48 01 82 c0 00 00 00 add %rax,0xc0(%rdx)
ffffffff804bdaad: 214 48 c7 83 28 04 00 00 movq $0x0,0x428(%rbx)
ffffffff804bdab4: 0 00 00 00 00
ffffffff804bdab8: 1530 c7 83 3c 04 00 00 00 movl $0x0,0x43c(%rbx)
ffffffff804bdabf: 0 00 00 00
ffffffff804bdac2: 1135 48 89 df mov %rbx,%rdi
ffffffff804bdac5: 3909 44 89 fe mov %r15d,%esi
ffffffff804bdac8: 0 e8 41 e6 ff ff callq ffffffff804bc10e <tcp_cleanup_rbuf>
ffffffff804bdacd: 1724 48 89 df mov %rbx,%rdi
ffffffff804bdad0: 932 e8 61 7c fc ff callq ffffffff80485736 <release_sock>
ffffffff804bdad5: 4661 e9 12 01 00 00 jmpq ffffffff804bdbec <tcp_recvmsg+0x77e>
ffffffff804bdada: 0 41 bc 95 ff ff ff mov $0xffffff95,%r12d
ffffffff804bdae0: 0 48 89 df mov %rbx,%rdi
ffffffff804bdae3: 0 45 89 e7 mov %r12d,%r15d
ffffffff804bdae6: 0 e8 4b 7c fc ff callq ffffffff80485736 <release_sock>
ffffffff804bdaeb: 0 e9 fc 00 00 00 jmpq ffffffff804bdbec <tcp_recvmsg+0x77e>
ffffffff804bdaf0: 0 be 02 00 00 00 mov $0x2,%esi
ffffffff804bdaf5: 0 48 89 df mov %rbx,%rdi
ffffffff804bdaf8: 0 e8 1b da ff ff callq ffffffff804bb518 <sock_flag>
ffffffff804bdafd: 0 85 c0 test %eax,%eax
ffffffff804bdaff: 0 0f 85 d4 00 00 00 jne ffffffff804bdbd9 <tcp_recvmsg+0x76b>
ffffffff804bdb05: 0 8b 83 7c 04 00 00 mov 0x47c(%rbx),%eax
ffffffff804bdb0b: 0 66 85 c0 test %ax,%ax
ffffffff804bdb0e: 0 0f 84 c5 00 00 00 je ffffffff804bdbd9 <tcp_recvmsg+0x76b>
ffffffff804bdb14: 0 66 3d 00 04 cmp $0x400,%ax
ffffffff804bdb18: 0 0f 84 bb 00 00 00 je ffffffff804bdbd9 <tcp_recvmsg+0x76b>
ffffffff804bdb1e: 0 8a 43 02 mov 0x2(%rbx),%al
ffffffff804bdb21: 0 3c 07 cmp $0x7,%al
ffffffff804bdb23: 0 75 17 jne ffffffff804bdb3c <tcp_recvmsg+0x6ce>
ffffffff804bdb25: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff804bdb2a: 0 48 89 df mov %rbx,%rdi
ffffffff804bdb2d: 0 41 bc 95 ff ff ff mov $0xffffff95,%r12d
ffffffff804bdb33: 0 e8 e0 d9 ff ff callq ffffffff804bb518 <sock_flag>
ffffffff804bdb38: 0 85 c0 test %eax,%eax
ffffffff804bdb3a: 0 74 a4 je ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdb3c: 0 8b 83 7c 04 00 00 mov 0x47c(%rbx),%eax
ffffffff804bdb42: 0 f6 c4 01 test $0x1,%ah
ffffffff804bdb45: 0 74 79 je ffffffff804bdbc0 <tcp_recvmsg+0x752>
ffffffff804bdb47: 0 40 f6 c5 02 test $0x2,%bpl
ffffffff804bdb4b: 0 88 44 24 67 mov %al,0x67(%rsp)
ffffffff804bdb4f: 0 75 09 jne ffffffff804bdb5a <tcp_recvmsg+0x6ec>
ffffffff804bdb51: 0 66 c7 83 7c 04 00 00 movw $0x400,0x47c(%rbx)
ffffffff804bdb58: 0 00 04
ffffffff804bdb5a: 0 48 8b 4c 24 30 mov 0x30(%rsp),%rcx
ffffffff804bdb5f: 0 45 89 f4 mov %r14d,%r12d
ffffffff804bdb62: 0 8b 51 30 mov 0x30(%rcx),%edx
ffffffff804bdb65: 0 89 d0 mov %edx,%eax
ffffffff804bdb67: 0 83 c8 01 or $0x1,%eax
ffffffff804bdb6a: 0 45 85 f6 test %r14d,%r14d
ffffffff804bdb6d: 0 89 41 30 mov %eax,0x30(%rcx)
ffffffff804bdb70: 0 7e 33 jle ffffffff804bdba5 <tcp_recvmsg+0x737>
ffffffff804bdb72: 0 40 80 e5 20 and $0x20,%bpl
ffffffff804bdb76: 0 41 bc 01 00 00 00 mov $0x1,%r12d
ffffffff804bdb7c: 0 0f 85 5e ff ff ff jne ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdb82: 0 48 8b 79 10 mov 0x10(%rcx),%rdi
ffffffff804bdb86: 0 48 8d 74 24 67 lea 0x67(%rsp),%rsi
ffffffff804bdb8b: 0 ba 01 00 00 00 mov $0x1,%edx
ffffffff804bdb90: 0 41 bc f2 ff ff ff mov $0xfffffff2,%r12d
ffffffff804bdb96: 0 e8 8a cb fc ff callq ffffffff8048a725 <memcpy_toiovec>
ffffffff804bdb9b: 0 85 c0 test %eax,%eax
ffffffff804bdb9d: 0 0f 85 3d ff ff ff jne ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdba3: 0 eb 10 jmp ffffffff804bdbb5 <tcp_recvmsg+0x747>
ffffffff804bdba5: 0 48 8b 44 24 30 mov 0x30(%rsp),%rax
ffffffff804bdbaa: 0 83 ca 21 or $0x21,%edx
ffffffff804bdbad: 0 89 50 30 mov %edx,0x30(%rax)
ffffffff804bdbb0: 0 e9 2b ff ff ff jmpq ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbb5: 0 41 bc 01 00 00 00 mov $0x1,%r12d
ffffffff804bdbbb: 0 e9 20 ff ff ff jmpq ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbc0: 0 8a 43 02 mov 0x2(%rbx),%al
ffffffff804bdbc3: 0 3c 07 cmp $0x7,%al
ffffffff804bdbc5: 0 74 1d je ffffffff804bdbe4 <tcp_recvmsg+0x776>
ffffffff804bdbc7: 0 f6 43 38 01 testb $0x1,0x38(%rbx)
ffffffff804bdbcb: 0 41 bc f5 ff ff ff mov $0xfffffff5,%r12d
ffffffff804bdbd1: 0 0f 84 09 ff ff ff je ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbd7: 0 eb 0b jmp ffffffff804bdbe4 <tcp_recvmsg+0x776>
ffffffff804bdbd9: 0 41 bc ea ff ff ff mov $0xffffffea,%r12d
ffffffff804bdbdf: 0 e9 fc fe ff ff jmpq ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbe4: 0 45 31 e4 xor %r12d,%r12d
ffffffff804bdbe7: 0 e9 f4 fe ff ff jmpq ffffffff804bdae0 <tcp_recvmsg+0x672>
ffffffff804bdbec: 1206 48 83 c4 68 add $0x68,%rsp
ffffffff804bdbf0: 498 44 89 f8 mov %r15d,%eax
ffffffff804bdbf3: 387 5b pop %rbx
ffffffff804bdbf4: 462 5d pop %rbp
ffffffff804bdbf5: 0 41 5c pop %r12
ffffffff804bdbf7: 485 41 5d pop %r13
ffffffff804bdbf9: 466 41 5e pop %r14
ffffffff804bdbfb: 0 41 5f pop %r15
ffffffff804bdbfd: 796 c3 retq

no real hotspots either - but a bit too fractured code sequence, so
this function's icache footprint is too probably double the size of
what it could be.

a bit of overhead (8%) leaks in from a callsite:

ffffffff804bd46e: 882 41 57 push %r15
ffffffff804bd470: 15507 48 89 f7 mov %rsi,%rdi

(this is used as a dynamic function pointer too so i'm just guessing
that the common callsite would be sock_common_recvmsg().)

perhaps this sequence, about 7% of the total overhead of this
function, warrants mention:

ffffffff804bd7e2: 0 e8 96 e5 ff ff callq ffffffff804bbd7d <lock_sock>
ffffffff804bd7e7: 0 eb 0d jmp ffffffff804bd7f6 <tcp_recvmsg+0x388>
ffffffff804bd7e9: 152 48 8d 74 24 58 lea 0x58(%rsp),%rsi
ffffffff804bd7ee: 563 48 89 df mov %rbx,%rdi
ffffffff804bd7f1: 59 e8 83 99 fc ff callq ffffffff80487179 <sk_wait_data>
ffffffff804bd7f6: 86 48 83 7c 24 50 00 cmpq $0x0,0x50(%rsp)
ffffffff804bd7fc: 8550 0f 84 8a 00 00 00 je ffffffff804bd88c <tcp_recvmsg+0x41e>
ffffffff804bd802: 4038 44 89 f1 mov %r14d,%ecx

that's most likely lock_sock[_nested]()'s overhead leaking over into
this function:

ffffffff804857cb: 9392 <lock_sock_nested>:
ffffffff804857cb: 9392 41 55 push %r13
ffffffff804857cd: 4112 41 54 push %r12
ffffffff804857cf: 2 55 push %rbp
ffffffff804857d0: 7 48 8d 6f 40 lea 0x40(%rdi),%rbp
ffffffff804857d4: 1515 53 push %rbx
ffffffff804857d5: 0 48 89 fb mov %rdi,%rbx
ffffffff804857d8: 4 48 89 ef mov %rbp,%rdi
ffffffff804857db: 1461 48 83 ec 38 sub $0x38,%rsp
ffffffff804857df: 8 e8 78 11 09 00 callq ffffffff8051695c <_spin_lock_bh>
ffffffff804857e4: 4827 83 7b 44 00 cmpl $0x0,0x44(%rbx)
ffffffff804857e8: 2937 74 6d je ffffffff80485857 <lock_sock_nested+0x8c>
ffffffff804857ea: 0 65 48 8b 14 25 00 00 mov %gs:0x0,%rdx
ffffffff804857f1: 0 00 00
ffffffff804857f3: 0 fc cld
ffffffff804857f4: 0 31 c0 xor %eax,%eax
ffffffff804857f6: 0 48 89 e7 mov %rsp,%rdi
ffffffff804857f9: 0 b9 0a 00 00 00 mov $0xa,%ecx
ffffffff804857fe: 0 f3 ab rep stos %eax,%es:(%rdi)
ffffffff80485800: 0 48 8d 44 24 18 lea 0x18(%rsp),%rax
ffffffff80485805: 0 4c 8d 63 48 lea 0x48(%rbx),%r12
ffffffff80485809: 0 48 89 54 24 08 mov %rdx,0x8(%rsp)
ffffffff8048580e: 0 48 c7 44 24 10 80 78 movq $0xffffffff80247880,0x10(%rsp)
ffffffff80485815: 0 24 80
ffffffff80485817: 0 48 89 44 24 18 mov %rax,0x18(%rsp)
ffffffff8048581c: 0 48 89 44 24 20 mov %rax,0x20(%rsp)
ffffffff80485821: 0 ba 02 00 00 00 mov $0x2,%edx
ffffffff80485826: 0 48 89 e6 mov %rsp,%rsi
ffffffff80485829: 0 4c 89 e7 mov %r12,%rdi
ffffffff8048582c: 0 e8 fd 20 dc ff callq ffffffff8024792e <prepare_to_wait_exclusive>
ffffffff80485831: 0 48 89 ef mov %rbp,%rdi
ffffffff80485834: 0 e8 18 11 09 00 callq ffffffff80516951 <_spin_unlock_bh>
ffffffff80485839: 0 e8 52 f9 08 00 callq ffffffff80515190 <schedule>
ffffffff8048583e: 0 48 89 ef mov %rbp,%rdi
ffffffff80485841: 0 e8 16 11 09 00 callq ffffffff8051695c <_spin_lock_bh>
ffffffff80485846: 0 83 7b 44 00 cmpl $0x0,0x44(%rbx)
ffffffff8048584a: 0 75 d5 jne ffffffff80485821 <lock_sock_nested+0x56>
ffffffff8048584c: 0 48 89 e6 mov %rsp,%rsi
ffffffff8048584f: 0 4c 89 e7 mov %r12,%rdi
ffffffff80485852: 0 e8 7a 20 dc ff callq ffffffff802478d1 <finish_wait>
ffffffff80485857: 88 c7 43 44 01 00 00 00 movl $0x1,0x44(%rbx)
ffffffff8048585e: 3431 fe 43 40 incb 0x40(%rbx)
ffffffff80485861: 1568 e8 00 4e db ff callq ffffffff8023a666 <local_bh_enable>
ffffffff80485866: 1548 48 83 c4 38 add $0x38,%rsp
ffffffff8048586a: 61 5b pop %rbx
ffffffff8048586b: 1568 5d pop %rbp
ffffffff8048586c: 36 41 5c pop %r12
ffffffff8048586e: 0 41 5d pop %r13
ffffffff80485870: 2753 c3 retq

which is:

1748 void lock_sock_nested(struct sock *sk, int subclass)
1749 {
1750 might_sleep();
1751 spin_lock_bh(&sk->sk_lock.slock);
1752 if (sk->sk_lock.owned)
1753 __lock_sock(sk);
1754 sk->sk_lock.owned = 1;
1755 spin_unlock(&sk->sk_lock.slock);

that branch in the middle should perhaps be:

if (unlikely(sk->sk_lock.owned))

to make this function fall-through.

Ingo

2008-11-17 21:27:30

by Ingo Molnar

[permalink] [raw]
Subject: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

> 100.000000 total
> ................
> 1.717771 eth_type_trans

hits (total: 171777)
.........
ffffffff8049e215: 457 <eth_type_trans>:
ffffffff8049e215: 457 41 54 push %r12
ffffffff8049e217: 6514 55 push %rbp
ffffffff8049e218: 0 48 89 f5 mov %rsi,%rbp
ffffffff8049e21b: 0 53 push %rbx
ffffffff8049e21c: 441 48 8b 87 d8 00 00 00 mov 0xd8(%rdi),%rax
ffffffff8049e223: 5 48 89 fb mov %rdi,%rbx
ffffffff8049e226: 0 2b 87 d0 00 00 00 sub 0xd0(%rdi),%eax
ffffffff8049e22c: 493 48 89 73 20 mov %rsi,0x20(%rbx)
ffffffff8049e230: 2 be 0e 00 00 00 mov $0xe,%esi
ffffffff8049e235: 0 89 87 c0 00 00 00 mov %eax,0xc0(%rdi)
ffffffff8049e23b: 472 e8 2c 98 fe ff callq ffffffff80487a6c <skb_pull>
ffffffff8049e240: 501 44 8b a3 c0 00 00 00 mov 0xc0(%rbx),%r12d
ffffffff8049e247: 763 4c 03 a3 d0 00 00 00 add 0xd0(%rbx),%r12
ffffffff8049e24e: 0 41 f6 04 24 01 testb $0x1,(%r12)
ffffffff8049e253: 497 74 26 je ffffffff8049e27b <eth_type_trans+0x66>
ffffffff8049e255: 0 48 8d b5 38 02 00 00 lea 0x238(%rbp),%rsi
ffffffff8049e25c: 0 4c 89 e7 mov %r12,%rdi
ffffffff8049e25f: 0 e8 49 fc ff ff callq ffffffff8049dead <compare_ether_addr>
ffffffff8049e264: 0 85 c0 test %eax,%eax
ffffffff8049e266: 0 8a 43 7d mov 0x7d(%rbx),%al
ffffffff8049e269: 0 75 08 jne ffffffff8049e273 <eth_type_trans+0x5e>
ffffffff8049e26b: 0 83 e0 f8 and $0xfffffffffffffff8,%eax
ffffffff8049e26e: 0 83 c8 01 or $0x1,%eax
ffffffff8049e271: 0 eb 24 jmp ffffffff8049e297 <eth_type_trans+0x82>
ffffffff8049e273: 0 83 e0 f8 and $0xfffffffffffffff8,%eax
ffffffff8049e276: 0 83 c8 02 or $0x2,%eax
ffffffff8049e279: 0 eb 1c jmp ffffffff8049e297 <eth_type_trans+0x82>
ffffffff8049e27b: 82 48 8d b5 18 02 00 00 lea 0x218(%rbp),%rsi
ffffffff8049e282: 8782 4c 89 e7 mov %r12,%rdi
ffffffff8049e285: 1752 e8 23 fc ff ff callq ffffffff8049dead <compare_ether_addr>
ffffffff8049e28a: 0 85 c0 test %eax,%eax
ffffffff8049e28c: 757 74 0c je ffffffff8049e29a <eth_type_trans+0x85>
ffffffff8049e28e: 0 8a 43 7d mov 0x7d(%rbx),%al
ffffffff8049e291: 0 83 e0 f8 and $0xfffffffffffffff8,%eax
ffffffff8049e294: 0 83 c8 03 or $0x3,%eax
ffffffff8049e297: 0 88 43 7d mov %al,0x7d(%rbx)
ffffffff8049e29a: 107 66 41 8b 44 24 0c mov 0xc(%r12),%ax
ffffffff8049e2a0: 1031 0f b7 c8 movzwl %ax,%ecx
ffffffff8049e2a3: 518 66 c1 e8 08 shr $0x8,%ax
ffffffff8049e2a7: 0 89 ca mov %ecx,%edx
ffffffff8049e2a9: 0 c1 e2 08 shl $0x8,%edx
ffffffff8049e2ac: 484 09 d0 or %edx,%eax
ffffffff8049e2ae: 0 0f b7 c0 movzwl %ax,%eax
ffffffff8049e2b1: 0 3d ff 05 00 00 cmp $0x5ff,%eax
ffffffff8049e2b6: 468 7f 18 jg ffffffff8049e2d0 <eth_type_trans+0xbb>
ffffffff8049e2b8: 0 48 8b 83 d8 00 00 00 mov 0xd8(%rbx),%rax
ffffffff8049e2bf: 0 b9 00 01 00 00 mov $0x100,%ecx
ffffffff8049e2c4: 0 66 83 38 ff cmpw $0xffffffffffffffff,(%rax)
ffffffff8049e2c8: 0 b8 00 04 00 00 mov $0x400,%eax
ffffffff8049e2cd: 0 0f 45 c8 cmovne %eax,%ecx
ffffffff8049e2d0: 0 5b pop %rbx
ffffffff8049e2d1: 85064 5d pop %rbp
ffffffff8049e2d2: 63776 41 5c pop %r12
ffffffff8049e2d4: 1 89 c8 mov %ecx,%eax
ffffffff8049e2d6: 474 c3 retq

small function, big bang - 1.7% of the total overhead.

90% of this function's cost is in the closing sequence. My guess would
be that it originates from ffffffff8049e2ae (the branch after that is
not taken), which corresponds to this source code context:

(gdb) list *0xffffffff8049e2ae
0xffffffff8049e2ae is in eth_type_trans (net/ethernet/eth.c:199).
194 if (netdev_uses_dsa_tags(dev))
195 return htons(ETH_P_DSA);
196 if (netdev_uses_trailer_tags(dev))
197 return htons(ETH_P_TRAILER);
198
199 if (ntohs(eth->h_proto) >= 1536)
200 return eth->h_proto;
201
202 rawp = skb->data;
203

eth->h_proto access.

Given that this workload does localhost networking, my guess would be
that eth->h_proto is bouncing around between 16 CPUs? At minimum this
read-mostly field should be separated from the bouncing bits.

Ingo

2008-11-17 21:35:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28



On Mon, 17 Nov 2008, Ingo Molnar wrote:
>
> this function _really_ hurts from a 16-bit op:
>
> ffffffff8048943e: 6503 66 c7 83 a8 00 00 00 movw $0x0,0xa8(%rbx)
> ffffffff80489445: 0 00 00
> ffffffff80489447: 174101 5b pop %rbx

I don't think that is it, actually. The 16-bit store just before it had a
zero count, even though anything that executes the second one will always
execute the first one too.

The fact is, x86 profiles are subtle at an instruction level, and you tend
to get profile hits _after_ the instruction that caused the cost because
an interrupt (even an NMI) is always delayed to the next instruction (the
one that didn't complete). And since the core will execute out-of-order,
you don't even know what that one is, since there could easily be
branches, but even in the absense of branches you have many instructions
executing together.

For example, in many situations the two 16-bit stores will happily execute
together, and what you see may simply be a cache miss on the line that was
stored to. The store buffer needs to resolve the read of the "pop" in
order to complete, so having a big count in between stores and a
subsequent load is not all that unlikely.

So doing per-instruction profiling is not useful unless you start looking
at what preceded the instruction, and because of the out-of-order nature,
you really almost have to look for cache misses or branch mispredicts.

One common reason for such a big count on an instruction that looks
perfectly simple is often that there is a branch to that instruction that
was mispredicted. Or that there was an instruction that was costly _long_
before, and that other instructions were in the shadow of that one
completing (ie they had actually completed first, but didn't retire until
the earlier instruction did).

So you really should never just look at the previous instruction or
anythign as simplistic as that. The time of in-order execution is long
past.

Linus

2008-11-17 21:36:04

by Ingo Molnar

[permalink] [raw]
Subject: __inet_lookup_established(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

> 100.000000 total
> ................
> 1.673249 __inet_lookup_established

hits (total: 167324)
.........
ffffffff804b9b12: 446 <__inet_lookup_established>:
ffffffff804b9b12: 446 41 57 push %r15
ffffffff804b9b14: 4810 89 d0 mov %edx,%eax
ffffffff804b9b16: 0 0f b7 c9 movzwl %cx,%ecx
ffffffff804b9b19: 0 41 56 push %r14
ffffffff804b9b1b: 456 41 55 push %r13
ffffffff804b9b1d: 0 41 54 push %r12
ffffffff804b9b1f: 0 55 push %rbp
ffffffff804b9b20: 427 53 push %rbx
ffffffff804b9b21: 4 48 89 f3 mov %rsi,%rbx
ffffffff804b9b24: 2 44 89 c6 mov %r8d,%esi
ffffffff804b9b27: 504 41 89 c8 mov %ecx,%r8d
ffffffff804b9b2a: 1 49 89 f7 mov %rsi,%r15
ffffffff804b9b2d: 1 48 83 ec 08 sub $0x8,%rsp
ffffffff804b9b31: 462 49 c1 e7 20 shl $0x20,%r15
ffffffff804b9b35: 0 48 89 3c 24 mov %rdi,(%rsp)
ffffffff804b9b39: 507 89 d7 mov %edx,%edi
ffffffff804b9b3b: 38 41 0f b7 d1 movzwl %r9w,%edx
ffffffff804b9b3f: 0 41 89 d6 mov %edx,%r14d
ffffffff804b9b42: 863 49 09 c7 or %rax,%r15
ffffffff804b9b45: 24 41 c1 e6 10 shl $0x10,%r14d
ffffffff804b9b49: 0 41 09 ce or %ecx,%r14d
ffffffff804b9b4c: 479 89 f9 mov %edi,%ecx
ffffffff804b9b4e: 8 48 8b 3c 24 mov (%rsp),%rdi
ffffffff804b9b52: 0 e8 cc f4 ff ff callq ffffffff804b9023 <inet_ehashfn>
ffffffff804b9b57: 413 48 89 df mov %rbx,%rdi
ffffffff804b9b5a: 122 41 89 c5 mov %eax,%r13d
ffffffff804b9b5d: 0 89 c6 mov %eax,%esi
ffffffff804b9b5f: 635 e8 3e f5 ff ff callq ffffffff804b90a2 <inet_ehash_bucket>
ffffffff804b9b64: 511 48 89 c5 mov %rax,%rbp
ffffffff804b9b67: 6 44 89 e8 mov %r13d,%eax
ffffffff804b9b6a: 0 23 43 14 and 0x14(%rbx),%eax
ffffffff804b9b6d: 497 4c 8d 24 85 00 00 00 lea 0x0(,%rax,4),%r12
ffffffff804b9b74: 0 00
ffffffff804b9b75: 1 4c 03 63 08 add 0x8(%rbx),%r12
ffffffff804b9b79: 0 48 8b 45 00 mov 0x0(%rbp),%rax
ffffffff804b9b7d: 470 0f 18 08 prefetcht0 (%rax)
ffffffff804b9b80: 0 4c 89 e7 mov %r12,%rdi
ffffffff804b9b83: 1089 e8 32 cd 05 00 callq ffffffff805168ba <_read_lock>
ffffffff804b9b88: 6752 48 8b 55 00 mov 0x0(%rbp),%rdx
ffffffff804b9b8c: 598 eb 2c jmp ffffffff804b9bba <__inet_lookup_established+0xa8>
ffffffff804b9b8e: 447 48 81 3c 24 d0 15 ab cmpq $0xffffffff80ab15d0,(%rsp)
ffffffff804b9b95: 0 80
ffffffff804b9b96: 1119 75 1f jne ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9b98: 21 4c 39 b8 30 02 00 00 cmp %r15,0x230(%rax)
ffffffff804b9b9f: 0 75 16 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9ba1: 492 44 39 b0 38 02 00 00 cmp %r14d,0x238(%rax)
ffffffff804b9ba8: 0 75 0d jne ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9baa: 0 8b 52 fc mov -0x4(%rdx),%edx
ffffffff804b9bad: 451 85 d2 test %edx,%edx
ffffffff804b9baf: 0 74 67 je ffffffff804b9c18 <__inet_lookup_established+0x106>
ffffffff804b9bb1: 0 3b 54 24 40 cmp 0x40(%rsp),%edx
ffffffff804b9bb5: 0 74 61 je ffffffff804b9c18 <__inet_lookup_established+0x106>
ffffffff804b9bb7: 0 48 89 ca mov %rcx,%rdx
ffffffff804b9bba: 402 48 85 d2 test %rdx,%rdx
ffffffff804b9bbd: 1006 74 12 je ffffffff804b9bd1 <__inet_lookup_established+0xbf>
ffffffff804b9bbf: 0 48 8d 42 f8 lea -0x8(%rdx),%rax
ffffffff804b9bc3: 821 48 8b 0a mov (%rdx),%rcx
ffffffff804b9bc6: 78 44 39 68 2c cmp %r13d,0x2c(%rax)
ffffffff804b9bca: 4 0f 18 09 prefetcht0 (%rcx)
ffffffff804b9bcd: 685 75 e8 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9bcf: 139502 eb bd jmp ffffffff804b9b8e <__inet_lookup_established+0x7c>
ffffffff804b9bd1: 0 48 8b 55 08 mov 0x8(%rbp),%rdx
ffffffff804b9bd5: 0 eb 26 jmp ffffffff804b9bfd <__inet_lookup_established+0xeb>
ffffffff804b9bd7: 0 48 81 3c 24 d0 15 ab cmpq $0xffffffff80ab15d0,(%rsp)
ffffffff804b9bde: 0 80
ffffffff804b9bdf: 0 75 19 jne ffffffff804b9bfa <__inet_lookup_established+0xe8>
ffffffff804b9be1: 0 4c 39 78 40 cmp %r15,0x40(%rax)
ffffffff804b9be5: 0 75 13 jne ffffffff804b9bfa <__inet_lookup_established+0xe8>
ffffffff804b9be7: 0 44 39 70 48 cmp %r14d,0x48(%rax)
ffffffff804b9beb: 0 75 0d jne ffffffff804b9bfa <__inet_lookup_established+0xe8>
ffffffff804b9bed: 0 8b 52 fc mov -0x4(%rdx),%edx
ffffffff804b9bf0: 0 85 d2 test %edx,%edx
ffffffff804b9bf2: 0 74 24 je ffffffff804b9c18 <__inet_lookup_established+0x106>
ffffffff804b9bf4: 0 3b 54 24 40 cmp 0x40(%rsp),%edx
ffffffff804b9bf8: 0 74 1e je ffffffff804b9c18 <__inet_lookup_established+0x106>
ffffffff804b9bfa: 0 48 89 ca mov %rcx,%rdx
ffffffff804b9bfd: 0 48 85 d2 test %rdx,%rdx
ffffffff804b9c00: 0 74 12 je ffffffff804b9c14 <__inet_lookup_established+0x102>
ffffffff804b9c02: 0 48 8d 42 f8 lea -0x8(%rdx),%rax
ffffffff804b9c06: 0 48 8b 0a mov (%rdx),%rcx
ffffffff804b9c09: 0 44 39 68 2c cmp %r13d,0x2c(%rax)
ffffffff804b9c0d: 0 0f 18 09 prefetcht0 (%rcx)
ffffffff804b9c10: 0 75 e8 jne ffffffff804b9bfa <__inet_lookup_established+0xe8>
ffffffff804b9c12: 0 eb c3 jmp ffffffff804b9bd7 <__inet_lookup_established+0xc5>
ffffffff804b9c14: 0 31 c0 xor %eax,%eax
ffffffff804b9c16: 0 eb 04 jmp ffffffff804b9c1c <__inet_lookup_established+0x10a>
ffffffff804b9c18: 441 f0 ff 40 28 lock incl 0x28(%rax)
ffffffff804b9c1c: 1442 f0 41 ff 04 24 lock incl (%r12)
ffffffff804b9c21: 476 41 5b pop %r11
ffffffff804b9c23: 1 5b pop %rbx
ffffffff804b9c24: 0 5d pop %rbp
ffffffff804b9c25: 475 41 5c pop %r12
ffffffff804b9c27: 0 41 5d pop %r13
ffffffff804b9c29: 1 41 5e pop %r14
ffffffff804b9c2b: 494 41 5f pop %r15
ffffffff804b9c2d: 0 c3 retq
ffffffff804b9c2e: 0 90 nop
ffffffff804b9c2f: 0 90 nop

80% of the overhead comes from cachemisses here:

ffffffff804b9bc6: 78 44 39 68 2c cmp %r13d,0x2c(%rax)
ffffffff804b9bca: 4 0f 18 09 prefetcht0 (%rcx)
ffffffff804b9bcd: 685 75 e8 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5>
ffffffff804b9bcf: 139502 eb bd jmp ffffffff804b9b8e <__inet_lookup_established+0x7c>

corresponding to:

(gdb) list *0xffffffff804b9bc6
0xffffffff804b9bc6 is in __inet_lookup_established (net/ipv4/inet_hashtables.c:237).
232 rwlock_t *lock = inet_ehash_lockp(hashinfo, hash);
233
234 prefetch(head->chain.first);
235 read_lock(lock);
236 sk_for_each(sk, node, &head->chain) {
237 if (INET_MATCH(sk, net, hash, acookie,
238 saddr, daddr, ports, dif))
239 goto hit; /* You sunk my battleship! */
240 }
241

Seeing the first hard cachemiss on hash lookups is a familiar and
partly expected pattern - it is the first thing that touches
cache-cold data structures.

Seeing 1.4% of the totaly tbench overhead go into this single
cachemiss is a bit surprising to me though: tbench works via
long-lived connections (TCP establish costs and nowhere to be seen in
the profiles) so the socket hash should be relatively stable and
read-mostly on most CPUs in theory. The CPUs here have 2MB of L2 cache
per socket.

Could we be somehow dirtying these cachelines perhaps, causing
unnecessary cachemisses in hash lookups? Is the hash linkage portion
of the socket data structure frequently dirtied? Padding that to 64
bytes (or next to 64 bytes worth of read-mostly fields) could perhaps
give us a +1.7% tbench speedup.

Ingo

2008-11-17 21:38:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: skb_release_head_state(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Linus Torvalds <[email protected]> wrote:

> On Mon, 17 Nov 2008, Ingo Molnar wrote:
> >
> > this function _really_ hurts from a 16-bit op:
> >
> > ffffffff8048943e: 6503 66 c7 83 a8 00 00 00 movw $0x0,0xa8(%rbx)
> > ffffffff80489445: 0 00 00
> > ffffffff80489447: 174101 5b pop %rbx
>
> I don't think that is it, actually. The 16-bit store just before it
> had a zero count, even though anything that executes the second one
> will always execute the first one too.

yeah - look at the followup bits that identify the likely real source
of that overhead:

>> _But_, the real overhead probably comes from:
>>
>> ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx
>>
>> which is the next line, the ttl field:
>>
>> 373 iph->ttl = ip_select_ttl(inet, &rt->u.dst);
>>
>> this shows that we are doing a hard cachemiss on the net-localhost
>> route dst structure cacheline. We do a plain load instruction from
>> it here and get a hefty cachemiss. (because 16 CPUs are banging on
>> that single route)
>>
>> And let make sure we see this in perspective as well: that single
>> cachemiss is _1.0 percent_ of the total tbench cost. (!) We could
>> make the scheduler 10% slower straight away and it would have less
>> of a real-life effect than this single iph->ttl field setting.

2008-11-17 21:41:01

by Eric Dumazet

[permalink] [raw]
Subject: Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Ingo Molnar a ?crit :
> * Ingo Molnar <[email protected]> wrote:
>
>> 100.000000 total
>> ................
>> 1.717771 eth_type_trans
>
> hits (total: 171777)
> .........
> ffffffff8049e215: 457 <eth_type_trans>:
> ffffffff8049e215: 457 41 54 push %r12
> ffffffff8049e217: 6514 55 push %rbp
> ffffffff8049e218: 0 48 89 f5 mov %rsi,%rbp
> ffffffff8049e21b: 0 53 push %rbx
> ffffffff8049e21c: 441 48 8b 87 d8 00 00 00 mov 0xd8(%rdi),%rax
> ffffffff8049e223: 5 48 89 fb mov %rdi,%rbx
> ffffffff8049e226: 0 2b 87 d0 00 00 00 sub 0xd0(%rdi),%eax
> ffffffff8049e22c: 493 48 89 73 20 mov %rsi,0x20(%rbx)
> ffffffff8049e230: 2 be 0e 00 00 00 mov $0xe,%esi
> ffffffff8049e235: 0 89 87 c0 00 00 00 mov %eax,0xc0(%rdi)
> ffffffff8049e23b: 472 e8 2c 98 fe ff callq ffffffff80487a6c <skb_pull>
> ffffffff8049e240: 501 44 8b a3 c0 00 00 00 mov 0xc0(%rbx),%r12d
> ffffffff8049e247: 763 4c 03 a3 d0 00 00 00 add 0xd0(%rbx),%r12
> ffffffff8049e24e: 0 41 f6 04 24 01 testb $0x1,(%r12)
> ffffffff8049e253: 497 74 26 je ffffffff8049e27b <eth_type_trans+0x66>
> ffffffff8049e255: 0 48 8d b5 38 02 00 00 lea 0x238(%rbp),%rsi
> ffffffff8049e25c: 0 4c 89 e7 mov %r12,%rdi
> ffffffff8049e25f: 0 e8 49 fc ff ff callq ffffffff8049dead <compare_ether_addr>
> ffffffff8049e264: 0 85 c0 test %eax,%eax
> ffffffff8049e266: 0 8a 43 7d mov 0x7d(%rbx),%al
> ffffffff8049e269: 0 75 08 jne ffffffff8049e273 <eth_type_trans+0x5e>
> ffffffff8049e26b: 0 83 e0 f8 and $0xfffffffffffffff8,%eax
> ffffffff8049e26e: 0 83 c8 01 or $0x1,%eax
> ffffffff8049e271: 0 eb 24 jmp ffffffff8049e297 <eth_type_trans+0x82>
> ffffffff8049e273: 0 83 e0 f8 and $0xfffffffffffffff8,%eax
> ffffffff8049e276: 0 83 c8 02 or $0x2,%eax
> ffffffff8049e279: 0 eb 1c jmp ffffffff8049e297 <eth_type_trans+0x82>
> ffffffff8049e27b: 82 48 8d b5 18 02 00 00 lea 0x218(%rbp),%rsi
> ffffffff8049e282: 8782 4c 89 e7 mov %r12,%rdi
> ffffffff8049e285: 1752 e8 23 fc ff ff callq ffffffff8049dead <compare_ether_addr>
> ffffffff8049e28a: 0 85 c0 test %eax,%eax
> ffffffff8049e28c: 757 74 0c je ffffffff8049e29a <eth_type_trans+0x85>
> ffffffff8049e28e: 0 8a 43 7d mov 0x7d(%rbx),%al
> ffffffff8049e291: 0 83 e0 f8 and $0xfffffffffffffff8,%eax
> ffffffff8049e294: 0 83 c8 03 or $0x3,%eax
> ffffffff8049e297: 0 88 43 7d mov %al,0x7d(%rbx)
> ffffffff8049e29a: 107 66 41 8b 44 24 0c mov 0xc(%r12),%ax
> ffffffff8049e2a0: 1031 0f b7 c8 movzwl %ax,%ecx
> ffffffff8049e2a3: 518 66 c1 e8 08 shr $0x8,%ax
> ffffffff8049e2a7: 0 89 ca mov %ecx,%edx
> ffffffff8049e2a9: 0 c1 e2 08 shl $0x8,%edx
> ffffffff8049e2ac: 484 09 d0 or %edx,%eax
> ffffffff8049e2ae: 0 0f b7 c0 movzwl %ax,%eax
> ffffffff8049e2b1: 0 3d ff 05 00 00 cmp $0x5ff,%eax
> ffffffff8049e2b6: 468 7f 18 jg ffffffff8049e2d0 <eth_type_trans+0xbb>
> ffffffff8049e2b8: 0 48 8b 83 d8 00 00 00 mov 0xd8(%rbx),%rax
> ffffffff8049e2bf: 0 b9 00 01 00 00 mov $0x100,%ecx
> ffffffff8049e2c4: 0 66 83 38 ff cmpw $0xffffffffffffffff,(%rax)
> ffffffff8049e2c8: 0 b8 00 04 00 00 mov $0x400,%eax
> ffffffff8049e2cd: 0 0f 45 c8 cmovne %eax,%ecx
> ffffffff8049e2d0: 0 5b pop %rbx
> ffffffff8049e2d1: 85064 5d pop %rbp
> ffffffff8049e2d2: 63776 41 5c pop %r12
> ffffffff8049e2d4: 1 89 c8 mov %ecx,%eax
> ffffffff8049e2d6: 474 c3 retq
>
> small function, big bang - 1.7% of the total overhead.
>
> 90% of this function's cost is in the closing sequence. My guess would
> be that it originates from ffffffff8049e2ae (the branch after that is
> not taken), which corresponds to this source code context:
>
> (gdb) list *0xffffffff8049e2ae
> 0xffffffff8049e2ae is in eth_type_trans (net/ethernet/eth.c:199).
> 194 if (netdev_uses_dsa_tags(dev))
> 195 return htons(ETH_P_DSA);
> 196 if (netdev_uses_trailer_tags(dev))
> 197 return htons(ETH_P_TRAILER);
> 198
> 199 if (ntohs(eth->h_proto) >= 1536)
> 200 return eth->h_proto;
> 201
> 202 rawp = skb->data;
> 203
>
> eth->h_proto access.
>
> Given that this workload does localhost networking, my guess would be
> that eth->h_proto is bouncing around between 16 CPUs? At minimum this
> read-mostly field should be separated from the bouncing bits.
>

"eth" is on the frame itself, so each cpu is handling a skb it owns.
If there is a cache line miss, then scheduler might have done a wrong schedule ?
(tbench server and tbench client on different cpus)

But seeing your disassembly, I can see compare_ether_addr() is not inlined.

This sucks.

/**
* compare_ether_addr - Compare two Ethernet addresses
* @addr1: Pointer to a six-byte array containing the Ethernet address
* @addr2: Pointer other six-byte array containing the Ethernet address
*
* Compare two ethernet addresses, returns 0 if equal
*/
static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
{
const u16 *a = (const u16 *) addr1;
const u16 *b = (const u16 *) addr2;

BUILD_BUG_ON(ETH_ALEN != 6);
return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;
}

On my machine/compiler, it is inlined, that makes a big difference.

c0420750 <eth_type_trans>: /* eth_type_trans total: 14417 0.4101 */

2008-11-17 21:52:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28



On Mon, 17 Nov 2008, Ingo Molnar wrote:
> ffffffff8049e2ae: 0 0f b7 c0 movzwl %ax,%eax
> ffffffff8049e2b1: 0 3d ff 05 00 00 cmp $0x5ff,%eax
> ffffffff8049e2b6: 468 7f 18 jg ffffffff8049e2d0 <eth_type_trans+0xbb>
> ffffffff8049e2b8: 0 48 8b 83 d8 00 00 00 mov 0xd8(%rbx),%rax
> ffffffff8049e2bf: 0 b9 00 01 00 00 mov $0x100,%ecx
> ffffffff8049e2c4: 0 66 83 38 ff cmpw $0xffffffffffffffff,(%rax)
> ffffffff8049e2c8: 0 b8 00 04 00 00 mov $0x400,%eax
> ffffffff8049e2cd: 0 0f 45 c8 cmovne %eax,%ecx
> ffffffff8049e2d0: 0 5b pop %rbx
> ffffffff8049e2d1: 85064 5d pop %rbp
> ffffffff8049e2d2: 63776 41 5c pop %r12
> ffffffff8049e2d4: 1 89 c8 mov %ecx,%eax
> ffffffff8049e2d6: 474 c3 retq
>
> small function, big bang - 1.7% of the total overhead.
>
> 90% of this function's cost is in the closing sequence. My guess would
> be that it originates from ffffffff8049e2ae (the branch after that is
> not taken), which corresponds to this source code context:

I would actually suspect that branch mispredicts may be an issue.

If that thing falls out of the branch prediction table (which it could
easily do), then a forward branch will be predicted as "not taken". And if
it then turns out that the _common_ case is the other way around, the
incorrectly predicted destination is often the one that shows up in
profiles.

Giving gcc likely()/unlikely() hints usually doesn't much help, I'm
afraid. It _can_ make a difference, but often not for -Os in particular.

Linus

2008-11-17 22:00:30

by Ingo Molnar

[permalink] [raw]
Subject: system_call() - Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

> 100.000000 total
> ................
> 1.508888 system_call

that's an easy one:

ffffffff8020be00: 97321 <system_call>:
ffffffff8020be00: 97321 0f 01 f8 swapgs
ffffffff8020be03: 53089 66 66 66 90 xchg %ax,%ax
ffffffff8020be07: 1524 66 66 90 xchg %ax,%ax
ffffffff8020be0a: 0 66 66 90 xchg %ax,%ax
ffffffff8020be0d: 0 66 66 90 xchg %ax,%ax

ffffffff8020be10: 1511 <system_call_after_swapgs>:
ffffffff8020be10: 1511 65 48 89 24 25 18 00 mov %rsp,%gs:0x18
ffffffff8020be17: 0 00 00
ffffffff8020be19: 0 65 48 8b 24 25 10 00 mov %gs:0x10,%rsp
ffffffff8020be20: 0 00 00
ffffffff8020be22: 1490 fb sti

syscall entry instruction costs - unavoidable security checks, etc. -
hardware costs.

But looking at this profile made me notice this detail:

ENTRY(system_call_after_swapgs)

Combined with this alignment rule we have in
arch/x86/include/asm/linkage.h on 64-bit:

#ifdef CONFIG_X86_64
#define __ALIGN .p2align 4,,15
#define __ALIGN_STR ".p2align 4,,15"
#endif

while it inserts NOP sequences, that is still +13 bytes of excessive,
stupid, and straight in our syscall entry path alignment padding.

system_call_after_swapgs is an utter slowpath in any case. The interim
fix is below - although it needs more thinking and probably should be
done via an ENTRY_UNALIGNED() method as well, for slowpath targets.

With that we get this much nicer entry sequence:

ffffffff8020be00: 544323 <system_call>:
ffffffff8020be00: 544323 0f 01 f8 swapgs

ffffffff8020be03: 197954 <system_call_after_swapgs>:
ffffffff8020be03: 197954 65 48 89 24 25 18 00 mov %rsp,%gs:0x18
ffffffff8020be0a: 0 00 00
ffffffff8020be0c: 6578 65 48 8b 24 25 10 00 mov %gs:0x10,%rsp
ffffffff8020be13: 0 00 00
ffffffff8020be15: 0 fb sti
ffffffff8020be16: 0 48 83 ec 50 sub $0x50,%rsp

And we should probably weaken the generic code alignment rules as well
on x86. I'll do some measurements of it.

Ingo

Index: linux/arch/x86/kernel/entry_64.S
===================================================================
--- linux.orig/arch/x86/kernel/entry_64.S
+++ linux/arch/x86/kernel/entry_64.S
@@ -315,7 +315,8 @@ ENTRY(system_call)
* after the swapgs, so that it can do the swapgs
* for the guest and jump here on syscall.
*/
-ENTRY(system_call_after_swapgs)
+.globl system_call_after_swapgs
+system_call_after_swapgs:

movq %rsp,%gs:pda_oldrsp
movq %gs:pda_kernelstack,%rsp

2008-11-17 22:09:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

> 100.000000 total
> ................
> 1.469183 tcp_current_mss

hits (total: 146918)
.........
ffffffff804c5237: 526 <tcp_current_mss>:
ffffffff804c5237: 526 41 54 push %r12
ffffffff804c5239: 5929 55 push %rbp
ffffffff804c523a: 32 53 push %rbx
ffffffff804c523b: 294 48 89 fb mov %rdi,%rbx
ffffffff804c523e: 539 48 83 ec 30 sub $0x30,%rsp
ffffffff804c5242: 2590 85 f6 test %esi,%esi
ffffffff804c5244: 444 48 8b 4f 78 mov 0x78(%rdi),%rcx
ffffffff804c5248: 521 8b af 4c 04 00 00 mov 0x44c(%rdi),%ebp
ffffffff804c524e: 791 74 2a je ffffffff804c527a <tcp_current_mss+0x43>
ffffffff804c5250: 433 8b 87 00 01 00 00 mov 0x100(%rdi),%eax
ffffffff804c5256: 236 c1 e0 10 shl $0x10,%eax
ffffffff804c5259: 191 89 c2 mov %eax,%edx
ffffffff804c525b: 487 23 97 fc 00 00 00 and 0xfc(%rdi),%edx
ffffffff804c5261: 362 39 c2 cmp %eax,%edx
ffffffff804c5263: 342 75 15 jne ffffffff804c527a <tcp_current_mss+0x43>
ffffffff804c5265: 473 45 31 e4 xor %r12d,%r12d
ffffffff804c5268: 221 8b 87 00 04 00 00 mov 0x400(%rdi),%eax
ffffffff804c526e: 194 3b 87 80 04 00 00 cmp 0x480(%rdi),%eax
ffffffff804c5274: 445 41 0f 94 c4 sete %r12b
ffffffff804c5278: 261 eb 03 jmp ffffffff804c527d <tcp_current_mss+0x46>
ffffffff804c527a: 0 45 31 e4 xor %r12d,%r12d
ffffffff804c527d: 185 48 85 c9 test %rcx,%rcx
ffffffff804c5280: 686 74 15 je ffffffff804c5297 <tcp_current_mss+0x60>
ffffffff804c5282: 1806 8b 71 7c mov 0x7c(%rcx),%esi
ffffffff804c5285: 1 3b b3 5c 03 00 00 cmp 0x35c(%rbx),%esi
ffffffff804c528b: 21 74 0a je ffffffff804c5297 <tcp_current_mss+0x60>
ffffffff804c528d: 0 48 89 df mov %rbx,%rdi
ffffffff804c5290: 0 e8 8b fb ff ff callq ffffffff804c4e20 <tcp_sync_mss>
ffffffff804c5295: 0 89 c5 mov %eax,%ebp
ffffffff804c5297: 864 48 8d 4c 24 28 lea 0x28(%rsp),%rcx
ffffffff804c529c: 634 48 8d 54 24 10 lea 0x10(%rsp),%rdx
ffffffff804c52a1: 995 31 f6 xor %esi,%esi
ffffffff804c52a3: 0 48 89 df mov %rbx,%rdi
ffffffff804c52a6: 2 e8 f2 fe ff ff callq ffffffff804c519d <tcp_established_options>
ffffffff804c52ab: 859 8b 8b e8 03 00 00 mov 0x3e8(%rbx),%ecx
ffffffff804c52b1: 936 83 c0 14 add $0x14,%eax
ffffffff804c52b4: 6 0f b7 d1 movzwl %cx,%edx
ffffffff804c52b7: 0 39 d0 cmp %edx,%eax
ffffffff804c52b9: 911 74 04 je ffffffff804c52bf <tcp_current_mss+0x88>
ffffffff804c52bb: 0 29 d0 sub %edx,%eax
ffffffff804c52bd: 0 29 c5 sub %eax,%ebp
ffffffff804c52bf: 0 45 85 e4 test %r12d,%r12d
ffffffff804c52c2: 6894 89 e8 mov %ebp,%eax
ffffffff804c52c4: 0 74 38 je ffffffff804c52fe <tcp_current_mss+0xc7>
ffffffff804c52c6: 990 48 8b 83 68 03 00 00 mov 0x368(%rbx),%rax
ffffffff804c52cd: 642 8b b3 04 01 00 00 mov 0x104(%rbx),%esi
ffffffff804c52d3: 3 48 89 df mov %rbx,%rdi
ffffffff804c52d6: 240 66 2b 70 30 sub 0x30(%rax),%si
ffffffff804c52da: 588 66 2b b3 7e 03 00 00 sub 0x37e(%rbx),%si
ffffffff804c52e1: 2 66 29 ce sub %cx,%si
ffffffff804c52e4: 284 ff ce dec %esi
ffffffff804c52e6: 664 0f b7 f6 movzwl %si,%esi
ffffffff804c52e9: 2 e8 0a fb ff ff callq ffffffff804c4df8 <tcp_bound_to_half_wnd>
ffffffff804c52ee: 68 0f b7 d0 movzwl %ax,%edx
ffffffff804c52f1: 1870 89 c1 mov %eax,%ecx
ffffffff804c52f3: 0 89 d0 mov %edx,%eax
ffffffff804c52f5: 0 31 d2 xor %edx,%edx
ffffffff804c52f7: 2135 f7 f5 div %ebp
ffffffff804c52f9: 107010 89 c8 mov %ecx,%eax
ffffffff804c52fb: 1670 66 29 d0 sub %dx,%ax
ffffffff804c52fe: 0 66 89 83 ea 03 00 00 mov %ax,0x3ea(%rbx)
ffffffff804c5305: 4 48 83 c4 30 add $0x30,%rsp
ffffffff804c5309: 855 89 e8 mov %ebp,%eax
ffffffff804c530b: 0 5b pop %rbx
ffffffff804c530c: 797 5d pop %rbp
ffffffff804c530d: 0 41 5c pop %r12
ffffffff804c530f: 0 c3 retq

apparently this division causes 1.0% of tbench overhead:

ffffffff804c52f5: 0 31 d2 xor %edx,%edx
ffffffff804c52f7: 2135 f7 f5 div %ebp
ffffffff804c52f9: 107010 89 c8 mov %ecx,%eax

(gdb) list *0xffffffff804c52f7
0xffffffff804c52f7 is in tcp_current_mss (net/ipv4/tcp_output.c:1078).
1073 inet_csk(sk)->icsk_af_ops->net_header_len -
1074 inet_csk(sk)->icsk_ext_hdr_len -
1075 tp->tcp_header_len);
1076
1077 xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
1078 xmit_size_goal -= (xmit_size_goal % mss_now);
1079 }
1080 tp->xmit_size_goal = xmit_size_goal;
1081
1082 return mss_now;
(gdb)

it's this division:

if (doing_tso) {
[...]
xmit_size_goal -= (xmit_size_goal % mss_now);

Has no-one hit this before? Perhaps this is why switching loopback
networking to TSO had a performance impact for others?

It's still a bit weird ... how can a single division cause this much
overhead? tcp_bound_to_half_wnd() [which is called straight before
this sequence] seems low-overhead.

Ingo

2008-11-17 22:11:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: system_call() - Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28



On Mon, 17 Nov 2008, Ingo Molnar wrote:
>
> syscall entry instruction costs - unavoidable security checks, etc. -
> hardware costs.

Yes. One thing to look out for on x86 is the system call _return_ path. It
doesn't show up in kernel profiles (it shows up as user costs), and we had
a bug where auditing essentially always caused us to use 'iret' instead of
'sysret' because it took us the long way around.

And profiling doesn't show it, but things like lmbench did, iret being
about five times slower than sysret.

But yes:

> -ENTRY(system_call_after_swapgs)
> +.globl system_call_after_swapgs
> +system_call_after_swapgs:

This definitely makes sense. We definitely do not want to align that
special case.

Linus

2008-11-17 22:15:30

by Ingo Molnar

[permalink] [raw]
Subject: tcp_transmit_skb() - Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

> 100.000000 total
> ................
> 1.431553 tcp_transmit_skb

hits (total: 143155)
.........
ffffffff804c550e: 485 <tcp_transmit_skb>:
ffffffff804c550e: 485 41 57 push %r15
ffffffff804c5510: 5692 41 56 push %r14
ffffffff804c5512: 390 49 89 f6 mov %rsi,%r14
ffffffff804c5515: 0 41 55 push %r13
ffffffff804c5517: 69 41 54 push %r12
ffffffff804c5519: 388 41 89 d4 mov %edx,%r12d
ffffffff804c551c: 0 55 push %rbp
ffffffff804c551d: 66 48 89 fd mov %rdi,%rbp
ffffffff804c5520: 405 53 push %rbx
ffffffff804c5521: 0 89 cb mov %ecx,%ebx
ffffffff804c5523: 75 48 83 ec 38 sub $0x38,%rsp
ffffffff804c5527: 396 48 85 f6 test %rsi,%rsi
ffffffff804c552a: 51 74 15 je ffffffff804c5541 <tcp_transmit_skb+0x33>
ffffffff804c552c: 396 8b 96 c8 00 00 00 mov 0xc8(%rsi),%edx
ffffffff804c5532: 1 48 8b 86 d0 00 00 00 mov 0xd0(%rsi),%rax
ffffffff804c5539: 63 66 83 7c 02 08 00 cmpw $0x0,0x8(%rdx,%rax,1)
ffffffff804c553f: 417 75 04 jne ffffffff804c5545 <tcp_transmit_skb+0x37>
ffffffff804c5541: 0 0f 0b ud2a
ffffffff804c5543: 0 eb fe jmp ffffffff804c5543 <tcp_transmit_skb+0x35>
ffffffff804c5545: 3719 48 8b 87 60 03 00 00 mov 0x360(%rdi),%rax
ffffffff804c554c: 2873 f6 40 10 02 testb $0x2,0x10(%rax)
ffffffff804c5550: 1 74 09 je ffffffff804c555b <tcp_transmit_skb+0x4d>
ffffffff804c5552: 0 e8 1d 48 d8 ff callq ffffffff80249d74 <ktime_get_real>
ffffffff804c5557: 0 49 89 46 18 mov %rax,0x18(%r14)
ffffffff804c555b: 487 45 85 e4 test %r12d,%r12d
ffffffff804c555e: 456 74 33 je ffffffff804c5593 <tcp_transmit_skb+0x85>
ffffffff804c5560: 0 4c 89 f7 mov %r14,%rdi
ffffffff804c5563: 482 e8 28 f4 ff ff callq ffffffff804c4990 <skb_cloned>
ffffffff804c5568: 1469 85 c0 test %eax,%eax
ffffffff804c556a: 1085 74 0c je ffffffff804c5578 <tcp_transmit_skb+0x6a>
ffffffff804c556c: 0 89 de mov %ebx,%esi
ffffffff804c556e: 0 4c 89 f7 mov %r14,%rdi
ffffffff804c5571: 0 e8 47 41 fc ff callq ffffffff804896bd <pskb_copy>
ffffffff804c5576: 0 eb 0a jmp ffffffff804c5582 <tcp_transmit_skb+0x74>
ffffffff804c5578: 0 89 de mov %ebx,%esi
ffffffff804c557a: 906 4c 89 f7 mov %r14,%rdi
ffffffff804c557d: 0 e8 ab 35 fc ff callq ffffffff80488b2d <skb_clone>
ffffffff804c5582: 0 48 85 c0 test %rax,%rax
ffffffff804c5585: 7 49 89 c6 mov %rax,%r14
ffffffff804c5588: 576 bb 97 ff ff ff mov $0xffffff97,%ebx
ffffffff804c558d: 0 0f 84 59 05 00 00 je ffffffff804c5aec <tcp_transmit_skb+0x5de>
ffffffff804c5593: 0 49 8d 46 38 lea 0x38(%r14),%rax
ffffffff804c5597: 699 48 8d 54 24 10 lea 0x10(%rsp),%rdx
ffffffff804c559c: 1 fc cld
ffffffff804c559d: 452 48 89 04 24 mov %rax,(%rsp)
ffffffff804c55a1: 40 48 89 d7 mov %rdx,%rdi
ffffffff804c55a4: 1 31 c0 xor %eax,%eax
ffffffff804c55a6: 432 ab stos %eax,%es:(%rdi)
ffffffff804c55a7: 956 ab stos %eax,%es:(%rdi)
ffffffff804c55a8: 959 ab stos %eax,%es:(%rdi)
ffffffff804c55a9: 910 ab stos %eax,%es:(%rdi)
ffffffff804c55aa: 943 48 8b 0c 24 mov (%rsp),%rcx
ffffffff804c55ae: 455 f6 41 24 02 testb $0x2,0x24(%rcx)
ffffffff804c55b2: 0 0f 84 b7 00 00 00 je ffffffff804c566f <tcp_transmit_skb+0x161>
ffffffff804c55b8: 0 48 8b 85 b8 05 00 00 mov 0x5b8(%rbp),%rax
ffffffff804c55bf: 0 48 89 ee mov %rbp,%rsi
ffffffff804c55c2: 0 48 89 ef mov %rbp,%rdi
ffffffff804c55c5: 0 ff 10 callq *(%rax)
ffffffff804c55c7: 0 31 f6 xor %esi,%esi
ffffffff804c55c9: 0 48 85 c0 test %rax,%rax
ffffffff804c55cc: 0 48 89 44 24 28 mov %rax,0x28(%rsp)
ffffffff804c55d1: 0 74 08 je ffffffff804c55db <tcp_transmit_skb+0xcd>
ffffffff804c55d3: 0 80 4c 24 10 04 orb $0x4,0x10(%rsp)
ffffffff804c55d8: 0 40 b6 14 mov $0x14,%sil
ffffffff804c55db: 0 48 8b 55 78 mov 0x78(%rbp),%rdx
ffffffff804c55df: 0 0f b7 85 5c 04 00 00 movzwl 0x45c(%rbp),%eax
ffffffff804c55e6: 0 48 85 d2 test %rdx,%rdx
ffffffff804c55e9: 0 74 13 je ffffffff804c55fe <tcp_transmit_skb+0xf0>
ffffffff804c55eb: 0 8b 92 94 00 00 00 mov 0x94(%rdx),%edx
ffffffff804c55f1: 0 39 c2 cmp %eax,%edx
ffffffff804c55f3: 0 73 09 jae ffffffff804c55fe <tcp_transmit_skb+0xf0>
ffffffff804c55f5: 0 89 d0 mov %edx,%eax
ffffffff804c55f7: 0 66 89 95 5c 04 00 00 mov %dx,0x45c(%rbp)
ffffffff804c55fe: 0 83 3d 23 2e 3f 00 00 cmpl $0x0,0x3f2e23(%rip) # ffffffff808b8428 <sysctl_tcp_timestamps>
ffffffff804c5605: 0 66 89 44 24 14 mov %ax,0x14(%rsp)
ffffffff804c560a: 0 8d 4e 04 lea 0x4(%rsi),%ecx
ffffffff804c560d: 0 74 25 je ffffffff804c5634 <tcp_transmit_skb+0x126>
ffffffff804c560f: 0 48 83 7c 24 28 00 cmpq $0x0,0x28(%rsp)
ffffffff804c5615: 0 75 1d jne ffffffff804c5634 <tcp_transmit_skb+0x126>
ffffffff804c5617: 0 48 8b 14 24 mov (%rsp),%rdx
ffffffff804c561b: 0 80 4c 24 10 02 orb $0x2,0x10(%rsp)
ffffffff804c5620: 0 8d 4e 10 lea 0x10(%rsi),%ecx
ffffffff804c5623: 0 8b 42 20 mov 0x20(%rdx),%eax
ffffffff804c5626: 0 89 44 24 18 mov %eax,0x18(%rsp)
ffffffff804c562a: 0 8b 85 90 04 00 00 mov 0x490(%rbp),%eax
ffffffff804c5630: 0 89 44 24 1c mov %eax,0x1c(%rsp)
ffffffff804c5634: 0 83 3d f1 2d 3f 00 00 cmpl $0x0,0x3f2df1(%rip) # ffffffff808b842c <sysctl_tcp_window_scaling>
ffffffff804c563b: 0 74 15 je ffffffff804c5652 <tcp_transmit_skb+0x144>
ffffffff804c563d: 0 8a 85 9d 04 00 00 mov 0x49d(%rbp),%al
ffffffff804c5643: 0 8d 51 04 lea 0x4(%rcx),%edx
ffffffff804c5646: 0 c0 e8 04 shr $0x4,%al
ffffffff804c5649: 0 84 c0 test %al,%al
ffffffff804c564b: 0 88 44 24 11 mov %al,0x11(%rsp)
ffffffff804c564f: 0 0f 45 ca cmovne %edx,%ecx
ffffffff804c5652: 0 83 3d d7 2d 3f 00 00 cmpl $0x0,0x3f2dd7(%rip) # ffffffff808b8430 <sysctl_tcp_sack>
ffffffff804c5659: 0 74 26 je ffffffff804c5681 <tcp_transmit_skb+0x173>
ffffffff804c565b: 0 8a 44 24 10 mov 0x10(%rsp),%al
ffffffff804c565f: 0 83 c8 01 or $0x1,%eax
ffffffff804c5662: 0 a8 02 test $0x2,%al
ffffffff804c5664: 0 88 44 24 10 mov %al,0x10(%rsp)
ffffffff804c5668: 0 75 17 jne ffffffff804c5681 <tcp_transmit_skb+0x173>
ffffffff804c566a: 0 83 c1 04 add $0x4,%ecx
ffffffff804c566d: 0 eb 12 jmp ffffffff804c5681 <tcp_transmit_skb+0x173>
ffffffff804c566f: 502 48 8d 4c 24 28 lea 0x28(%rsp),%rcx
ffffffff804c5674: 638 4c 89 f6 mov %r14,%rsi
ffffffff804c5677: 0 48 89 ef mov %rbp,%rdi
ffffffff804c567a: 0 e8 1e fb ff ff callq ffffffff804c519d <tcp_established_options>
ffffffff804c567f: 468 89 c1 mov %eax,%ecx
ffffffff804c5681: 1605 8b 85 74 04 00 00 mov 0x474(%rbp),%eax
ffffffff804c5687: 307 03 85 78 04 00 00 add 0x478(%rbp),%eax
ffffffff804c568d: 0 44 8d 69 14 lea 0x14(%rcx),%r13d
ffffffff804c5691: 409 2b 85 d0 04 00 00 sub 0x4d0(%rbp),%eax
ffffffff804c5697: 89 3b 85 cc 04 00 00 cmp 0x4cc(%rbp),%eax
ffffffff804c569d: 0 75 0a jne ffffffff804c56a9 <tcp_transmit_skb+0x19b>
ffffffff804c569f: 415 31 f6 xor %esi,%esi
ffffffff804c56a1: 210 48 89 ef mov %rbp,%rdi
ffffffff804c56a4: 0 e8 b0 f3 ff ff callq ffffffff804c4a59 <tcp_ca_event>
ffffffff804c56a9: 1050 44 89 ee mov %r13d,%esi
ffffffff804c56ac: 1063 4c 89 f7 mov %r14,%rdi
ffffffff804c56af: 0 e8 00 34 fc ff callq ffffffff80488ab4 <skb_push>
ffffffff804c56b4: 0 4c 89 f7 mov %r14,%rdi
ffffffff804c56b7: 789 e8 4f f3 ff ff callq ffffffff804c4a0b <skb_reset_transport_header>
ffffffff804c56bc: 509 f0 ff 45 28 lock incl 0x28(%rbp)
ffffffff804c56c0: 494 49 89 6e 10 mov %rbp,0x10(%r14)
ffffffff804c56c4: 3510 49 c7 86 80 00 00 00 movq $0xffffffff80486679,0x80(%r14)
ffffffff804c56cb: 0 79 66 48 80
ffffffff804c56cf: 102 41 8b 86 e0 00 00 00 mov 0xe0(%r14),%eax
ffffffff804c56d6: 155 f0 01 85 98 00 00 00 lock add %eax,0x98(%rbp)
ffffffff804c56dd: 437 41 8b 9e b8 00 00 00 mov 0xb8(%r14),%ebx
ffffffff804c56e4: 219 8b 85 50 02 00 00 mov 0x250(%rbp),%eax
ffffffff804c56ea: 71 49 03 9e d0 00 00 00 add 0xd0(%r14),%rbx
ffffffff804c56f1: 735 66 89 03 mov %ax,(%rbx)
ffffffff804c56f4: 0 8b 85 38 02 00 00 mov 0x238(%rbp),%eax
ffffffff804c56fa: 75 66 89 43 02 mov %ax,0x2(%rbx)
ffffffff804c56fe: 720 48 8b 0c 24 mov (%rsp),%rcx
ffffffff804c5702: 5992 8b 41 18 mov 0x18(%rcx),%eax
ffffffff804c5705: 1460 0f c8 bswap %eax
ffffffff804c5707: 60 89 43 04 mov %eax,0x4(%rbx)
ffffffff804c570a: 69 8b 85 f0 03 00 00 mov 0x3f0(%rbp),%eax
ffffffff804c5710: 374 0f c8 bswap %eax
ffffffff804c5712: 43 89 43 08 mov %eax,0x8(%rbx)
ffffffff804c5715: 76 0f b6 51 24 movzbl 0x24(%rcx),%edx
ffffffff804c5719: 337 44 89 e8 mov %r13d,%eax
ffffffff804c571c: 36 c1 e8 02 shr $0x2,%eax
ffffffff804c571f: 76 c1 e0 0c shl $0xc,%eax
ffffffff804c5722: 476 09 d0 or %edx,%eax
ffffffff804c5724: 48 66 c1 c0 08 rol $0x8,%ax
ffffffff804c5728: 51 66 89 43 0c mov %ax,0xc(%rbx)
ffffffff804c572c: 370 0f b6 41 24 movzbl 0x24(%rcx),%eax
ffffffff804c5730: 137 89 c2 mov %eax,%edx
ffffffff804c5732: 118 83 e2 02 and $0x2,%edx
ffffffff804c5735: 377 74 1b je ffffffff804c5752 <tcp_transmit_skb+0x244>
ffffffff804c5737: 0 81 bd c0 04 00 00 ff cmpl $0xffff,0x4c0(%rbp)
ffffffff804c573e: 0 ff 00 00
ffffffff804c5741: 0 b8 ff ff 00 00 mov $0xffff,%eax
ffffffff804c5746: 0 0f 46 85 c0 04 00 00 cmovbe 0x4c0(%rbp),%eax
ffffffff804c574d: 0 e9 a0 00 00 00 jmpq ffffffff804c57f2 <tcp_transmit_skb+0x2e4>
ffffffff804c5752: 34 8b 85 f8 03 00 00 mov 0x3f8(%rbp),%eax
ffffffff804c5758: 5610 03 85 c0 04 00 00 add 0x4c0(%rbp),%eax
ffffffff804c575e: 44 41 89 d4 mov %edx,%r12d
ffffffff804c5761: 539 2b 85 f0 03 00 00 sub 0x3f0(%rbp),%eax
ffffffff804c5767: 1 48 89 ef mov %rbp,%rdi
ffffffff804c576a: 51 44 0f 49 e0 cmovns %eax,%r12d
ffffffff804c576e: 495 e8 7e f8 ff ff callq ffffffff804c4ff1 <__tcp_select_window>
ffffffff804c5773: 484 44 39 e0 cmp %r12d,%eax
ffffffff804c5776: 244 89 c2 mov %eax,%edx
ffffffff804c5778: 0 73 19 jae ffffffff804c5793 <tcp_transmit_skb+0x285>
ffffffff804c577a: 0 8a 8d 9d 04 00 00 mov 0x49d(%rbp),%cl
ffffffff804c5780: 0 b8 01 00 00 00 mov $0x1,%eax
ffffffff804c5785: 0 c0 e9 04 shr $0x4,%cl
ffffffff804c5788: 0 d3 e0 shl %cl,%eax
ffffffff804c578a: 0 42 8d 54 20 ff lea -0x1(%rax,%r12,1),%edx
ffffffff804c578f: 0 f7 d8 neg %eax
ffffffff804c5791: 0 21 c2 and %eax,%edx
ffffffff804c5793: 217 f6 85 9d 04 00 00 f0 testb $0xf0,0x49d(%rbp)
ffffffff804c579a: 2014 8b 85 f0 03 00 00 mov 0x3f0(%rbp),%eax
ffffffff804c57a0: 0 89 95 c0 04 00 00 mov %edx,0x4c0(%rbp)
ffffffff804c57a6: 490 89 85 f8 03 00 00 mov %eax,0x3f8(%rbp)
ffffffff804c57ac: 1 75 16 jne ffffffff804c57c4 <tcp_transmit_skb+0x2b6>
ffffffff804c57ae: 0 83 3d bb 2c 3f 00 00 cmpl $0x0,0x3f2cbb(%rip) # ffffffff808b8470 <sysctl_tcp_workaround_signed_windows>
ffffffff804c57b5: 0 74 0d je ffffffff804c57c4 <tcp_transmit_skb+0x2b6>
ffffffff804c57b7: 0 b8 ff 7f 00 00 mov $0x7fff,%eax
ffffffff804c57bc: 0 81 fa ff 7f 00 00 cmp $0x7fff,%edx
ffffffff804c57c2: 0 eb 12 jmp ffffffff804c57d6 <tcp_transmit_skb+0x2c8>
ffffffff804c57c4: 0 8a 8d 9d 04 00 00 mov 0x49d(%rbp),%cl
ffffffff804c57ca: 7025 b8 ff ff 00 00 mov $0xffff,%eax
ffffffff804c57cf: 0 c0 e9 04 shr $0x4,%cl
ffffffff804c57d2: 418 d3 e0 shl %cl,%eax
ffffffff804c57d4: 102 39 c2 cmp %eax,%edx
ffffffff804c57d6: 0 8a 8d 9d 04 00 00 mov 0x49d(%rbp),%cl
ffffffff804c57dc: 424 0f 46 c2 cmovbe %edx,%eax
ffffffff804c57df: 105 c0 e9 04 shr $0x4,%cl
ffffffff804c57e2: 9 d3 e8 shr %cl,%eax
ffffffff804c57e4: 389 85 c0 test %eax,%eax
ffffffff804c57e6: 76 75 0a jne ffffffff804c57f2 <tcp_transmit_skb+0x2e4>
ffffffff804c57e8: 0 c7 85 ec 03 00 00 00 movl $0x0,0x3ec(%rbp)
ffffffff804c57ef: 0 00 00 00
ffffffff804c57f2: 2 66 c1 c0 08 rol $0x8,%ax
ffffffff804c57f6: 1657 66 c7 43 10 00 00 movw $0x0,0x10(%rbx)
ffffffff804c57fc: 35 66 c7 43 12 00 00 movw $0x0,0x12(%rbx)
ffffffff804c5802: 4377 66 89 43 0e mov %ax,0xe(%rbx)
ffffffff804c5806: 954 8b 95 80 04 00 00 mov 0x480(%rbp),%edx
ffffffff804c580c: 31 39 95 00 04 00 00 cmp %edx,0x400(%rbp)
ffffffff804c5812: 186 74 27 je ffffffff804c583b <tcp_transmit_skb+0x32d>
ffffffff804c5814: 0 48 8b 34 24 mov (%rsp),%rsi
ffffffff804c5818: 0 8b 4e 18 mov 0x18(%rsi),%ecx
ffffffff804c581b: 0 89 d6 mov %edx,%esi
ffffffff804c581d: 0 8d 41 01 lea 0x1(%rcx),%eax
ffffffff804c5820: 0 29 c6 sub %eax,%esi
ffffffff804c5822: 0 81 fe fe ff 00 00 cmp $0xfffe,%esi
ffffffff804c5828: 0 77 11 ja ffffffff804c583b <tcp_transmit_skb+0x32d>
ffffffff804c582a: 0 89 d0 mov %edx,%eax
ffffffff804c582c: 0 80 4b 0d 20 orb $0x20,0xd(%rbx)
ffffffff804c5830: 0 66 29 c8 sub %cx,%ax
ffffffff804c5833: 0 66 c1 c0 08 rol $0x8,%ax
ffffffff804c5837: 0 66 89 43 12 mov %ax,0x12(%rbx)
ffffffff804c583b: 268 48 8d 7b 14 lea 0x14(%rbx),%rdi
ffffffff804c583f: 187 48 8d 4c 24 20 lea 0x20(%rsp),%rcx
ffffffff804c5844: 4006 48 8d 54 24 10 lea 0x10(%rsp),%rdx
ffffffff804c5849: 1117 48 89 ee mov %rbp,%rsi
ffffffff804c584c: 0 e8 a9 fb ff ff callq ffffffff804c53fa <tcp_options_write>
ffffffff804c5851: 1285 48 8b 04 24 mov (%rsp),%rax
ffffffff804c5855: 727 f6 40 24 02 testb $0x2,0x24(%rax)
ffffffff804c5859: 0 0f 85 8f 00 00 00 jne ffffffff804c58ee <tcp_transmit_skb+0x3e0>
ffffffff804c585f: 0 f6 85 7e 04 00 00 01 testb $0x1,0x47e(%rbp)
ffffffff804c5866: 456 0f 84 82 00 00 00 je ffffffff804c58ee <tcp_transmit_skb+0x3e0>
ffffffff804c586c: 0 45 39 6e 68 cmp %r13d,0x68(%r14)
ffffffff804c5870: 0 74 53 je ffffffff804c58c5 <tcp_transmit_skb+0x3b7>
ffffffff804c5872: 0 8b 95 fc 03 00 00 mov 0x3fc(%rbp),%edx
ffffffff804c5878: 0 39 50 18 cmp %edx,0x18(%rax)
ffffffff804c587b: 0 78 48 js ffffffff804c58c5 <tcp_transmit_skb+0x3b7>
ffffffff804c587d: 0 8a 85 7e 04 00 00 mov 0x47e(%rbp),%al
ffffffff804c5883: 0 80 8d 54 02 00 00 02 orb $0x2,0x254(%rbp)
ffffffff804c588a: 0 a8 02 test $0x2,%al
ffffffff804c588c: 0 74 3e je ffffffff804c58cc <tcp_transmit_skb+0x3be>
ffffffff804c588e: 0 83 e0 fd and $0xfffffffffffffffd,%eax
ffffffff804c5891: 0 88 85 7e 04 00 00 mov %al,0x47e(%rbp)
ffffffff804c5897: 0 41 8b 8e b8 00 00 00 mov 0xb8(%r14),%ecx
ffffffff804c589e: 0 49 8b 96 d0 00 00 00 mov 0xd0(%r14),%rdx
ffffffff804c58a5: 0 8a 44 11 0d mov 0xd(%rcx,%rdx,1),%al
ffffffff804c58a9: 0 83 c8 80 or $0xffffffffffffff80,%eax
ffffffff804c58ac: 0 88 44 0a 0d mov %al,0xd(%rdx,%rcx,1)
ffffffff804c58b0: 0 41 8b 86 c8 00 00 00 mov 0xc8(%r14),%eax
ffffffff804c58b7: 0 49 03 86 d0 00 00 00 add 0xd0(%r14),%rax
ffffffff804c58be: 0 66 83 48 0a 08 orw $0x8,0xa(%rax)
ffffffff804c58c3: 0 eb 07 jmp ffffffff804c58cc <tcp_transmit_skb+0x3be>
ffffffff804c58c5: 0 80 a5 54 02 00 00 fc andb $0xfc,0x254(%rbp)
ffffffff804c58cc: 0 f6 85 7e 04 00 00 04 testb $0x4,0x47e(%rbp)
ffffffff804c58d3: 0 74 19 je ffffffff804c58ee <tcp_transmit_skb+0x3e0>
ffffffff804c58d5: 0 41 8b 8e b8 00 00 00 mov 0xb8(%r14),%ecx
ffffffff804c58dc: 0 49 8b 96 d0 00 00 00 mov 0xd0(%r14),%rdx
ffffffff804c58e3: 0 8a 44 11 0d mov 0xd(%rcx,%rdx,1),%al
ffffffff804c58e7: 0 83 c8 40 or $0x40,%eax
ffffffff804c58ea: 0 88 44 0a 0d mov %al,0xd(%rdx,%rcx,1)
ffffffff804c58ee: 0 48 83 7c 24 28 00 cmpq $0x0,0x28(%rsp)
ffffffff804c58f4: 9425 74 26 je ffffffff804c591c <tcp_transmit_skb+0x40e>
ffffffff804c58f6: 0 48 8b 85 b8 05 00 00 mov 0x5b8(%rbp),%rax
ffffffff804c58fd: 0 81 a5 fc 00 00 00 ff andl $0xffff,0xfc(%rbp)
ffffffff804c5904: 0 ff 00 00
ffffffff804c5907: 0 4d 89 f0 mov %r14,%r8
ffffffff804c590a: 0 48 8b 74 24 28 mov 0x28(%rsp),%rsi
ffffffff804c590f: 0 48 8b 7c 24 20 mov 0x20(%rsp),%rdi
ffffffff804c5914: 0 31 c9 xor %ecx,%ecx
ffffffff804c5916: 0 48 89 ea mov %rbp,%rdx
ffffffff804c5919: 0 ff 50 08 callq *0x8(%rax)
ffffffff804c591c: 0 48 8b 85 68 03 00 00 mov 0x368(%rbp),%rax
ffffffff804c5923: 2344 41 8b 76 68 mov 0x68(%r14),%esi
ffffffff804c5927: 1 4c 89 f2 mov %r14,%rdx
ffffffff804c592a: 0 48 89 ef mov %rbp,%rdi
ffffffff804c592d: 486 ff 50 08 callq *0x8(%rax)
ffffffff804c5930: 44 48 8b 0c 24 mov (%rsp),%rcx
ffffffff804c5934: 836 f6 41 24 10 testb $0x10,0x24(%rcx)
ffffffff804c5938: 0 74 4f je ffffffff804c5989 <tcp_transmit_skb+0x47b>
ffffffff804c593a: 75 41 8b 96 c8 00 00 00 mov 0xc8(%r14),%edx
ffffffff804c5941: 8600 49 8b 86 d0 00 00 00 mov 0xd0(%r14),%rax
ffffffff804c5948: 1667 8b 44 10 08 mov 0x8(%rax,%rdx,1),%eax
ffffffff804c594c: 13 8a 95 81 03 00 00 mov 0x381(%rbp),%dl
ffffffff804c5952: 24 84 d2 test %dl,%dl
ffffffff804c5954: 429 74 25 je ffffffff804c597b <tcp_transmit_skb+0x46d>
ffffffff804c5956: 0 0f b7 c8 movzwl %ax,%ecx
ffffffff804c5959: 3 0f b6 c2 movzbl %dl,%eax
ffffffff804c595c: 0 39 c1 cmp %eax,%ecx
ffffffff804c595e: 0 72 13 jb ffffffff804c5973 <tcp_transmit_skb+0x465>
ffffffff804c5960: 0 c6 85 81 03 00 00 00 movb $0x0,0x381(%rbp)
ffffffff804c5967: 1 c7 85 84 03 00 00 0a movl $0xa,0x384(%rbp)
ffffffff804c596e: 0 00 00 00
ffffffff804c5971: 0 eb 08 jmp ffffffff804c597b <tcp_transmit_skb+0x46d>
ffffffff804c5973: 1 28 ca sub %cl,%dl
ffffffff804c5975: 0 88 95 81 03 00 00 mov %dl,0x381(%rbp)
ffffffff804c597b: 11 c6 85 80 03 00 00 00 movb $0x0,0x380(%rbp)
ffffffff804c5982: 4553 c6 85 83 03 00 00 00 movb $0x0,0x383(%rbp)
ffffffff804c5989: 714 45 39 6e 68 cmp %r13d,0x68(%r14)
ffffffff804c598d: 1 0f 84 e2 00 00 00 je ffffffff804c5a75 <tcp_transmit_skb+0x567>
ffffffff804c5993: 288 83 3d e6 2a 3f 00 00 cmpl $0x0,0x3f2ae6(%rip) # ffffffff808b8480 <sysctl_tcp_slow_start_after_idle>
ffffffff804c599a: 247 48 8b 05 df 3e 3f 00 mov 0x3f3edf(%rip),%rax # ffffffff808b9880 <jiffies>
ffffffff804c59a1: 711 41 89 c7 mov %eax,%r15d
ffffffff804c59a4: 0 0f 84 ad 00 00 00 je ffffffff804c5a57 <tcp_transmit_skb+0x549>
ffffffff804c59aa: 159 83 bd 74 04 00 00 00 cmpl $0x0,0x474(%rbp)
ffffffff804c59b1: 311 0f 85 a0 00 00 00 jne ffffffff804c5a57 <tcp_transmit_skb+0x549>
ffffffff804c59b7: 0 44 8b ad 0c 04 00 00 mov 0x40c(%rbp),%r13d
ffffffff804c59be: 183 44 29 e8 sub %r13d,%eax
ffffffff804c59c1: 475 3b 85 58 03 00 00 cmp 0x358(%rbp),%eax
ffffffff804c59c7: 54 0f 86 8a 00 00 00 jbe ffffffff804c5a57 <tcp_transmit_skb+0x549>
ffffffff804c59cd: 0 48 8b 75 78 mov 0x78(%rbp),%rsi
ffffffff804c59d1: 1 48 8b 05 a8 3e 3f 00 mov 0x3f3ea8(%rip),%rax # ffffffff808b9880 <jiffies>
ffffffff804c59d8: 0 48 89 ef mov %rbp,%rdi
ffffffff804c59db: 0 48 89 44 24 08 mov %rax,0x8(%rsp)
ffffffff804c59e0: 0 e8 9c 92 ff ff callq ffffffff804bec81 <tcp_init_cwnd>
ffffffff804c59e5: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff804c59ea: 0 48 89 ef mov %rbp,%rdi
ffffffff804c59ed: 0 41 89 c4 mov %eax,%r12d
ffffffff804c59f0: 0 8b 9d ac 04 00 00 mov 0x4ac(%rbp),%ebx
ffffffff804c59f6: 0 e8 5e f0 ff ff callq ffffffff804c4a59 <tcp_ca_event>
ffffffff804c59fb: 0 48 89 ef mov %rbp,%rdi
ffffffff804c59fe: 0 e8 6d f0 ff ff callq ffffffff804c4a70 <tcp_current_ssthresh>
ffffffff804c5a03: 0 89 85 a8 04 00 00 mov %eax,0x4a8(%rbp)
ffffffff804c5a09: 4 8b 85 58 03 00 00 mov 0x358(%rbp),%eax
ffffffff804c5a0f: 0 41 39 dc cmp %ebx,%r12d
ffffffff804c5a12: 0 8b 54 24 08 mov 0x8(%rsp),%edx
ffffffff804c5a16: 0 89 d9 mov %ebx,%ecx
ffffffff804c5a18: 0 41 0f 46 cc cmovbe %r12d,%ecx
ffffffff804c5a1c: 0 89 c6 mov %eax,%esi
ffffffff804c5a1e: 0 44 29 ea sub %r13d,%edx
ffffffff804c5a21: 0 f7 de neg %esi
ffffffff804c5a23: 0 29 c2 sub %eax,%edx
ffffffff804c5a25: 0 89 d8 mov %ebx,%eax
ffffffff804c5a27: 0 eb 02 jmp ffffffff804c5a2b <tcp_transmit_skb+0x51d>
ffffffff804c5a29: 0 d1 e8 shr %eax
ffffffff804c5a2b: 0 85 d2 test %edx,%edx
ffffffff804c5a2d: 1 7e 06 jle ffffffff804c5a35 <tcp_transmit_skb+0x527>
ffffffff804c5a2f: 0 01 f2 add %esi,%edx
ffffffff804c5a31: 0 39 c8 cmp %ecx,%eax
ffffffff804c5a33: 0 77 f4 ja ffffffff804c5a29 <tcp_transmit_skb+0x51b>
ffffffff804c5a35: 0 39 c8 cmp %ecx,%eax
ffffffff804c5a37: 1 0f 43 c8 cmovae %eax,%ecx
ffffffff804c5a3a: 0 89 8d ac 04 00 00 mov %ecx,0x4ac(%rbp)
ffffffff804c5a40: 0 48 8b 05 39 3e 3f 00 mov 0x3f3e39(%rip),%rax # ffffffff808b9880 <jiffies>
ffffffff804c5a47: 0 c7 85 b8 04 00 00 00 movl $0x0,0x4b8(%rbp)
ffffffff804c5a4e: 0 00 00 00
ffffffff804c5a51: 0 89 85 bc 04 00 00 mov %eax,0x4bc(%rbp)
ffffffff804c5a57: 173 44 89 bd 0c 04 00 00 mov %r15d,0x40c(%rbp)
ffffffff804c5a5e: 5224 44 2b bd 90 03 00 00 sub 0x390(%rbp),%r15d
ffffffff804c5a65: 478 44 3b bd 84 03 00 00 cmp 0x384(%rbp),%r15d
ffffffff804c5a6c: 0 73 07 jae ffffffff804c5a75 <tcp_transmit_skb+0x567>
ffffffff804c5a6e: 38 c6 85 82 03 00 00 01 movb $0x1,0x382(%rbp)
ffffffff804c5a75: 452 48 8b 14 24 mov (%rsp),%rdx
ffffffff804c5a79: 312 8b 42 1c mov 0x1c(%rdx),%eax
ffffffff804c5a7c: 33 39 85 fc 03 00 00 cmp %eax,0x3fc(%rbp)
ffffffff804c5a82: 4768 78 05 js ffffffff804c5a89 <tcp_transmit_skb+0x57b>
ffffffff804c5a84: 0 39 42 18 cmp %eax,0x18(%rdx)
ffffffff804c5a87: 20 75 37 jne ffffffff804c5ac0 <tcp_transmit_skb+0x5b2>
ffffffff804c5a89: 30 65 48 8b 04 25 10 00 mov %gs:0x10,%rax
ffffffff804c5a90: 0 00 00
ffffffff804c5a92: 1059 8b 80 48 e0 ff ff mov -0x1fb8(%rax),%eax
ffffffff804c5a98: 21 65 8b 14 25 24 00 00 mov %gs:0x24,%edx
ffffffff804c5a9f: 0 00
ffffffff804c5aa0: 14 89 d2 mov %edx,%edx
ffffffff804c5aa2: 471 30 c0 xor %al,%al
ffffffff804c5aa4: 3 66 83 f8 01 cmp $0x1,%ax
ffffffff804c5aa8: 21 48 19 c0 sbb %rax,%rax
ffffffff804c5aab: 433 83 e0 08 and $0x8,%eax
ffffffff804c5aae: 2 48 8b 80 98 16 ab 80 mov -0x7f54e968(%rax),%rax
ffffffff804c5ab5: 16 48 f7 d0 not %rax
ffffffff804c5ab8: 457 48 8b 04 d0 mov (%rax,%rdx,8),%rax
ffffffff804c5abc: 3 48 ff 40 58 incq 0x58(%rax)
ffffffff804c5ac0: 20 48 8b 85 68 03 00 00 mov 0x368(%rbp),%rax
ffffffff804c5ac7: 424 31 f6 xor %esi,%esi
ffffffff804c5ac9: 2 4c 89 f7 mov %r14,%rdi
ffffffff804c5acc: 20 ff 10 callq *(%rax)
ffffffff804c5ace: 0 85 c0 test %eax,%eax
ffffffff804c5ad0: 9596 89 c3 mov %eax,%ebx
ffffffff804c5ad2: 0 7e 18 jle ffffffff804c5aec <tcp_transmit_skb+0x5de>
ffffffff804c5ad4: 0 be 01 00 00 00 mov $0x1,%esi
ffffffff804c5ad9: 0 48 89 ef mov %rbp,%rdi
ffffffff804c5adc: 0 e8 d9 91 ff ff callq ffffffff804becba <tcp_enter_cwr>
ffffffff804c5ae1: 0 83 fb 02 cmp $0x2,%ebx
ffffffff804c5ae4: 0 b8 00 00 00 00 mov $0x0,%eax
ffffffff804c5ae9: 0 0f 44 d8 cmove %eax,%ebx
ffffffff804c5aec: 457 48 83 c4 38 add $0x38,%rsp
ffffffff804c5af0: 1473 89 d8 mov %ebx,%eax
ffffffff804c5af2: 0 5b pop %rbx
ffffffff804c5af3: 480 5d pop %rbp
ffffffff804c5af4: 0 41 5c pop %r12
ffffffff804c5af6: 0 41 5d pop %r13
ffffffff804c5af8: 449 41 5e pop %r14
ffffffff804c5afa: 0 41 5f pop %r15
ffffffff804c5afc: 0 c3 retq

looks like spread-out overhead with no particular bad spike. Just
called a lot.

Ingo

2008-11-17 22:15:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: __inet_lookup_established(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Ingo Molnar a ?crit :
> * Ingo Molnar <[email protected]> wrote:
>
>> 100.000000 total
>> ................
>> 1.673249 __inet_lookup_established
>
> hits (total: 167324)
> .........
> ffffffff804b9b12: 446 <__inet_lookup_established>:
> ffffffff804b9b12: 446 41 57 push %r15
> ffffffff804b9b14: 4810 89 d0 mov %edx,%eax
> ffffffff804b9b16: 0 0f b7 c9 movzwl %cx,%ecx
> ffffffff804b9b19: 0 41 56 push %r14
> ffffffff804b9b1b: 456 41 55 push %r13
> ffffffff804b9b1d: 0 41 54 push %r12
> ffffffff804b9b1f: 0 55 push %rbp
> ffffffff804b9b20: 427 53 push %rbx
> ffffffff804b9b21: 4 48 89 f3 mov %rsi,%rbx
> ffffffff804b9b24: 2 44 89 c6 mov %r8d,%esi
> ffffffff804b9b27: 504 41 89 c8 mov %ecx,%r8d
> ffffffff804b9b2a: 1 49 89 f7 mov %rsi,%r15
> ffffffff804b9b2d: 1 48 83 ec 08 sub $0x8,%rsp
> ffffffff804b9b31: 462 49 c1 e7 20 shl $0x20,%r15
> ffffffff804b9b35: 0 48 89 3c 24 mov %rdi,(%rsp)
> ffffffff804b9b39: 507 89 d7 mov %edx,%edi
> ffffffff804b9b3b: 38 41 0f b7 d1 movzwl %r9w,%edx
> ffffffff804b9b3f: 0 41 89 d6 mov %edx,%r14d
> ffffffff804b9b42: 863 49 09 c7 or %rax,%r15
> ffffffff804b9b45: 24 41 c1 e6 10 shl $0x10,%r14d
> ffffffff804b9b49: 0 41 09 ce or %ecx,%r14d
> ffffffff804b9b4c: 479 89 f9 mov %edi,%ecx
> ffffffff804b9b4e: 8 48 8b 3c 24 mov (%rsp),%rdi
> ffffffff804b9b52: 0 e8 cc f4 ff ff callq ffffffff804b9023 <inet_ehashfn>
> ffffffff804b9b57: 413 48 89 df mov %rbx,%rdi
> ffffffff804b9b5a: 122 41 89 c5 mov %eax,%r13d
> ffffffff804b9b5d: 0 89 c6 mov %eax,%esi
> ffffffff804b9b5f: 635 e8 3e f5 ff ff callq ffffffff804b90a2 <inet_ehash_bucket>
> ffffffff804b9b64: 511 48 89 c5 mov %rax,%rbp
> ffffffff804b9b67: 6 44 89 e8 mov %r13d,%eax
> ffffffff804b9b6a: 0 23 43 14 and 0x14(%rbx),%eax
> ffffffff804b9b6d: 497 4c 8d 24 85 00 00 00 lea 0x0(,%rax,4),%r12
> ffffffff804b9b74: 0 00
> ffffffff804b9b75: 1 4c 03 63 08 add 0x8(%rbx),%r12
> ffffffff804b9b79: 0 48 8b 45 00 mov 0x0(%rbp),%rax
> ffffffff804b9b7d: 470 0f 18 08 prefetcht0 (%rax)
> ffffffff804b9b80: 0 4c 89 e7 mov %r12,%rdi
> ffffffff804b9b83: 1089 e8 32 cd 05 00 callq ffffffff805168ba <_read_lock>
> ffffffff804b9b88: 6752 48 8b 55 00 mov 0x0(%rbp),%rdx
> ffffffff804b9b8c: 598 eb 2c jmp ffffffff804b9bba <__inet_lookup_established+0xa8>
> ffffffff804b9b8e: 447 48 81 3c 24 d0 15 ab cmpq $0xffffffff80ab15d0,(%rsp)
> ffffffff804b9b95: 0 80
> ffffffff804b9b96: 1119 75 1f jne ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9b98: 21 4c 39 b8 30 02 00 00 cmp %r15,0x230(%rax)
> ffffffff804b9b9f: 0 75 16 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9ba1: 492 44 39 b0 38 02 00 00 cmp %r14d,0x238(%rax)
> ffffffff804b9ba8: 0 75 0d jne ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9baa: 0 8b 52 fc mov -0x4(%rdx),%edx
> ffffffff804b9bad: 451 85 d2 test %edx,%edx
> ffffffff804b9baf: 0 74 67 je ffffffff804b9c18 <__inet_lookup_established+0x106>
> ffffffff804b9bb1: 0 3b 54 24 40 cmp 0x40(%rsp),%edx
> ffffffff804b9bb5: 0 74 61 je ffffffff804b9c18 <__inet_lookup_established+0x106>
> ffffffff804b9bb7: 0 48 89 ca mov %rcx,%rdx
> ffffffff804b9bba: 402 48 85 d2 test %rdx,%rdx
> ffffffff804b9bbd: 1006 74 12 je ffffffff804b9bd1 <__inet_lookup_established+0xbf>
> ffffffff804b9bbf: 0 48 8d 42 f8 lea -0x8(%rdx),%rax
> ffffffff804b9bc3: 821 48 8b 0a mov (%rdx),%rcx
> ffffffff804b9bc6: 78 44 39 68 2c cmp %r13d,0x2c(%rax)
> ffffffff804b9bca: 4 0f 18 09 prefetcht0 (%rcx)
> ffffffff804b9bcd: 685 75 e8 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9bcf: 139502 eb bd jmp ffffffff804b9b8e <__inet_lookup_established+0x7c>
> ffffffff804b9bd1: 0 48 8b 55 08 mov 0x8(%rbp),%rdx
> ffffffff804b9bd5: 0 eb 26 jmp ffffffff804b9bfd <__inet_lookup_established+0xeb>
> ffffffff804b9bd7: 0 48 81 3c 24 d0 15 ab cmpq $0xffffffff80ab15d0,(%rsp)
> ffffffff804b9bde: 0 80
> ffffffff804b9bdf: 0 75 19 jne ffffffff804b9bfa <__inet_lookup_established+0xe8>
> ffffffff804b9be1: 0 4c 39 78 40 cmp %r15,0x40(%rax)
> ffffffff804b9be5: 0 75 13 jne ffffffff804b9bfa <__inet_lookup_established+0xe8>
> ffffffff804b9be7: 0 44 39 70 48 cmp %r14d,0x48(%rax)
> ffffffff804b9beb: 0 75 0d jne ffffffff804b9bfa <__inet_lookup_established+0xe8>
> ffffffff804b9bed: 0 8b 52 fc mov -0x4(%rdx),%edx
> ffffffff804b9bf0: 0 85 d2 test %edx,%edx
> ffffffff804b9bf2: 0 74 24 je ffffffff804b9c18 <__inet_lookup_established+0x106>
> ffffffff804b9bf4: 0 3b 54 24 40 cmp 0x40(%rsp),%edx
> ffffffff804b9bf8: 0 74 1e je ffffffff804b9c18 <__inet_lookup_established+0x106>
> ffffffff804b9bfa: 0 48 89 ca mov %rcx,%rdx
> ffffffff804b9bfd: 0 48 85 d2 test %rdx,%rdx
> ffffffff804b9c00: 0 74 12 je ffffffff804b9c14 <__inet_lookup_established+0x102>
> ffffffff804b9c02: 0 48 8d 42 f8 lea -0x8(%rdx),%rax
> ffffffff804b9c06: 0 48 8b 0a mov (%rdx),%rcx
> ffffffff804b9c09: 0 44 39 68 2c cmp %r13d,0x2c(%rax)
> ffffffff804b9c0d: 0 0f 18 09 prefetcht0 (%rcx)
> ffffffff804b9c10: 0 75 e8 jne ffffffff804b9bfa <__inet_lookup_established+0xe8>
> ffffffff804b9c12: 0 eb c3 jmp ffffffff804b9bd7 <__inet_lookup_established+0xc5>
> ffffffff804b9c14: 0 31 c0 xor %eax,%eax
> ffffffff804b9c16: 0 eb 04 jmp ffffffff804b9c1c <__inet_lookup_established+0x10a>
> ffffffff804b9c18: 441 f0 ff 40 28 lock incl 0x28(%rax)
> ffffffff804b9c1c: 1442 f0 41 ff 04 24 lock incl (%r12)
> ffffffff804b9c21: 476 41 5b pop %r11
> ffffffff804b9c23: 1 5b pop %rbx
> ffffffff804b9c24: 0 5d pop %rbp
> ffffffff804b9c25: 475 41 5c pop %r12
> ffffffff804b9c27: 0 41 5d pop %r13
> ffffffff804b9c29: 1 41 5e pop %r14
> ffffffff804b9c2b: 494 41 5f pop %r15
> ffffffff804b9c2d: 0 c3 retq
> ffffffff804b9c2e: 0 90 nop
> ffffffff804b9c2f: 0 90 nop
>
> 80% of the overhead comes from cachemisses here:
>
> ffffffff804b9bc6: 78 44 39 68 2c cmp %r13d,0x2c(%rax)
> ffffffff804b9bca: 4 0f 18 09 prefetcht0 (%rcx)
> ffffffff804b9bcd: 685 75 e8 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5>
> ffffffff804b9bcf: 139502 eb bd jmp ffffffff804b9b8e <__inet_lookup_established+0x7c>
>
> corresponding to:
>
> (gdb) list *0xffffffff804b9bc6
> 0xffffffff804b9bc6 is in __inet_lookup_established (net/ipv4/inet_hashtables.c:237).
> 232 rwlock_t *lock = inet_ehash_lockp(hashinfo, hash);
> 233
> 234 prefetch(head->chain.first);
> 235 read_lock(lock);
> 236 sk_for_each(sk, node, &head->chain) {
> 237 if (INET_MATCH(sk, net, hash, acookie,
> 238 saddr, daddr, ports, dif))
> 239 goto hit; /* You sunk my battleship! */
> 240 }
> 241
>
> Seeing the first hard cachemiss on hash lookups is a familiar and
> partly expected pattern - it is the first thing that touches
> cache-cold data structures.
>
> Seeing 1.4% of the totaly tbench overhead go into this single
> cachemiss is a bit surprising to me though: tbench works via
> long-lived connections (TCP establish costs and nowhere to be seen in
> the profiles) so the socket hash should be relatively stable and
> read-mostly on most CPUs in theory. The CPUs here have 2MB of L2 cache
> per socket.
>
> Could we be somehow dirtying these cachelines perhaps, causing
> unnecessary cachemisses in hash lookups? Is the hash linkage portion
> of the socket data structure frequently dirtied? Padding that to 64
> bytes (or next to 64 bytes worth of read-mostly fields) could perhaps
> give us a +1.7% tbench speedup.
>

I am not seeing this of course on net-next-2.6 thanks to RCU

Could it be that several tbench sockets are hashed on same chain ?

tbench uses dst address and src address 127.0.0.1 for its sockets.
server binds on port 7003


static inline unsigned int inet_ehashfn(struct net *net,
const __be32 laddr, const __u16 lport,
const __be32 faddr, const __be16 fport)
{
return jhash_3words((__force __u32) laddr,
(__force __u32) faddr,
((__u32) lport) << 16 | (__force __u32)fport,
inet_ehash_secret + net_hash_mix(net));
}

Hum... should be OK, thanks to jhash.

Maybe same problem than eth_type_trans :

You have a cache line miss because the socket we handle in the chain was previously
handled by another cpu. (sk->refcnt being dirtied by this other cpu)


ffffffff804b9bc6: 78 44 39 68 2c cmp %r13d,0x2c(%rax)
ffffffff804b9bca: 4 0f 18 09 prefetcht0 (%rcx)

ffffffff804b9bcd: 685 75 e8 jne ffffffff804b9bb7 <__inet_lookup_established+0xa5>
< "jne" stalls beccause CPU must bring to its cache 0x2c(%rax) to perform compare >

ffffffff804b9bcf: 139502 eb bd jmp ffffffff804b9b8e <__inet_lookup_established+0x7c>

Even if you padd/move refcnt somewhere else in sk, you'll need to take a reference on it,
so it wont help very much.

2008-11-17 22:17:16

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Ingo Molnar a ?crit :
> * Ingo Molnar <[email protected]> wrote:
>
>> 100.000000 total
>> ................
>> 1.469183 tcp_current_mss
>
> hits (total: 146918)
> .........
> ffffffff804c5237: 526 <tcp_current_mss>:
> ffffffff804c5237: 526 41 54 push %r12
> ffffffff804c5239: 5929 55 push %rbp
> ffffffff804c523a: 32 53 push %rbx
> ffffffff804c523b: 294 48 89 fb mov %rdi,%rbx
> ffffffff804c523e: 539 48 83 ec 30 sub $0x30,%rsp
> ffffffff804c5242: 2590 85 f6 test %esi,%esi
> ffffffff804c5244: 444 48 8b 4f 78 mov 0x78(%rdi),%rcx
> ffffffff804c5248: 521 8b af 4c 04 00 00 mov 0x44c(%rdi),%ebp
> ffffffff804c524e: 791 74 2a je ffffffff804c527a <tcp_current_mss+0x43>
> ffffffff804c5250: 433 8b 87 00 01 00 00 mov 0x100(%rdi),%eax
> ffffffff804c5256: 236 c1 e0 10 shl $0x10,%eax
> ffffffff804c5259: 191 89 c2 mov %eax,%edx
> ffffffff804c525b: 487 23 97 fc 00 00 00 and 0xfc(%rdi),%edx
> ffffffff804c5261: 362 39 c2 cmp %eax,%edx
> ffffffff804c5263: 342 75 15 jne ffffffff804c527a <tcp_current_mss+0x43>
> ffffffff804c5265: 473 45 31 e4 xor %r12d,%r12d
> ffffffff804c5268: 221 8b 87 00 04 00 00 mov 0x400(%rdi),%eax
> ffffffff804c526e: 194 3b 87 80 04 00 00 cmp 0x480(%rdi),%eax
> ffffffff804c5274: 445 41 0f 94 c4 sete %r12b
> ffffffff804c5278: 261 eb 03 jmp ffffffff804c527d <tcp_current_mss+0x46>
> ffffffff804c527a: 0 45 31 e4 xor %r12d,%r12d
> ffffffff804c527d: 185 48 85 c9 test %rcx,%rcx
> ffffffff804c5280: 686 74 15 je ffffffff804c5297 <tcp_current_mss+0x60>
> ffffffff804c5282: 1806 8b 71 7c mov 0x7c(%rcx),%esi
> ffffffff804c5285: 1 3b b3 5c 03 00 00 cmp 0x35c(%rbx),%esi
> ffffffff804c528b: 21 74 0a je ffffffff804c5297 <tcp_current_mss+0x60>
> ffffffff804c528d: 0 48 89 df mov %rbx,%rdi
> ffffffff804c5290: 0 e8 8b fb ff ff callq ffffffff804c4e20 <tcp_sync_mss>
> ffffffff804c5295: 0 89 c5 mov %eax,%ebp
> ffffffff804c5297: 864 48 8d 4c 24 28 lea 0x28(%rsp),%rcx
> ffffffff804c529c: 634 48 8d 54 24 10 lea 0x10(%rsp),%rdx
> ffffffff804c52a1: 995 31 f6 xor %esi,%esi
> ffffffff804c52a3: 0 48 89 df mov %rbx,%rdi
> ffffffff804c52a6: 2 e8 f2 fe ff ff callq ffffffff804c519d <tcp_established_options>
> ffffffff804c52ab: 859 8b 8b e8 03 00 00 mov 0x3e8(%rbx),%ecx
> ffffffff804c52b1: 936 83 c0 14 add $0x14,%eax
> ffffffff804c52b4: 6 0f b7 d1 movzwl %cx,%edx
> ffffffff804c52b7: 0 39 d0 cmp %edx,%eax
> ffffffff804c52b9: 911 74 04 je ffffffff804c52bf <tcp_current_mss+0x88>
> ffffffff804c52bb: 0 29 d0 sub %edx,%eax
> ffffffff804c52bd: 0 29 c5 sub %eax,%ebp
> ffffffff804c52bf: 0 45 85 e4 test %r12d,%r12d
> ffffffff804c52c2: 6894 89 e8 mov %ebp,%eax
> ffffffff804c52c4: 0 74 38 je ffffffff804c52fe <tcp_current_mss+0xc7>
> ffffffff804c52c6: 990 48 8b 83 68 03 00 00 mov 0x368(%rbx),%rax
> ffffffff804c52cd: 642 8b b3 04 01 00 00 mov 0x104(%rbx),%esi
> ffffffff804c52d3: 3 48 89 df mov %rbx,%rdi
> ffffffff804c52d6: 240 66 2b 70 30 sub 0x30(%rax),%si
> ffffffff804c52da: 588 66 2b b3 7e 03 00 00 sub 0x37e(%rbx),%si
> ffffffff804c52e1: 2 66 29 ce sub %cx,%si
> ffffffff804c52e4: 284 ff ce dec %esi
> ffffffff804c52e6: 664 0f b7 f6 movzwl %si,%esi
> ffffffff804c52e9: 2 e8 0a fb ff ff callq ffffffff804c4df8 <tcp_bound_to_half_wnd>
> ffffffff804c52ee: 68 0f b7 d0 movzwl %ax,%edx
> ffffffff804c52f1: 1870 89 c1 mov %eax,%ecx
> ffffffff804c52f3: 0 89 d0 mov %edx,%eax
> ffffffff804c52f5: 0 31 d2 xor %edx,%edx
> ffffffff804c52f7: 2135 f7 f5 div %ebp
> ffffffff804c52f9: 107010 89 c8 mov %ecx,%eax
> ffffffff804c52fb: 1670 66 29 d0 sub %dx,%ax
> ffffffff804c52fe: 0 66 89 83 ea 03 00 00 mov %ax,0x3ea(%rbx)
> ffffffff804c5305: 4 48 83 c4 30 add $0x30,%rsp
> ffffffff804c5309: 855 89 e8 mov %ebp,%eax
> ffffffff804c530b: 0 5b pop %rbx
> ffffffff804c530c: 797 5d pop %rbp
> ffffffff804c530d: 0 41 5c pop %r12
> ffffffff804c530f: 0 c3 retq
>
> apparently this division causes 1.0% of tbench overhead:
>
> ffffffff804c52f5: 0 31 d2 xor %edx,%edx
> ffffffff804c52f7: 2135 f7 f5 div %ebp
> ffffffff804c52f9: 107010 89 c8 mov %ecx,%eax
>
> (gdb) list *0xffffffff804c52f7
> 0xffffffff804c52f7 is in tcp_current_mss (net/ipv4/tcp_output.c:1078).
> 1073 inet_csk(sk)->icsk_af_ops->net_header_len -
> 1074 inet_csk(sk)->icsk_ext_hdr_len -
> 1075 tp->tcp_header_len);
> 1076
> 1077 xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
> 1078 xmit_size_goal -= (xmit_size_goal % mss_now);
> 1079 }
> 1080 tp->xmit_size_goal = xmit_size_goal;
> 1081
> 1082 return mss_now;
> (gdb)
>
> it's this division:
>
> if (doing_tso) {
> [...]
> xmit_size_goal -= (xmit_size_goal % mss_now);
>
> Has no-one hit this before? Perhaps this is why switching loopback
> networking to TSO had a performance impact for others?

Yes, I mentioned it later. But apparently you dont read my mails, so
I will just stop now.

>
> It's still a bit weird ... how can a single division cause this much
> overhead? tcp_bound_to_half_wnd() [which is called straight before
> this sequence] seems low-overhead.
>
> Ingo
>
>

2008-11-17 22:20:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Ingo Molnar <[email protected]> wrote:

> 100.000000 total
> ................
> 1.385125 tcp_sendmsg

this too is spread out, no spikes i noticed.

Seems like the subsequent functions seem to be spread out pretty
evenly, with no particular spikes visible.

Ingo

2008-11-17 22:27:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Eric Dumazet <[email protected]> wrote:

> Ingo Molnar a ?crit :
>> * Ingo Molnar <[email protected]> wrote:
>>
>>> 100.000000 total
>>> ................
>>> 1.469183 tcp_current_mss
>>
>> hits (total: 146918)
>> .........
>> ffffffff804c5237: 526 <tcp_current_mss>:
>> ffffffff804c5237: 526 41 54 push %r12
>> ffffffff804c5239: 5929 55 push %rbp
>> ffffffff804c523a: 32 53 push %rbx
>> ffffffff804c523b: 294 48 89 fb mov %rdi,%rbx
>> ffffffff804c523e: 539 48 83 ec 30 sub $0x30,%rsp
>> ffffffff804c5242: 2590 85 f6 test %esi,%esi
>> ffffffff804c5244: 444 48 8b 4f 78 mov 0x78(%rdi),%rcx
>> ffffffff804c5248: 521 8b af 4c 04 00 00 mov 0x44c(%rdi),%ebp
>> ffffffff804c524e: 791 74 2a je ffffffff804c527a <tcp_current_mss+0x43>
>> ffffffff804c5250: 433 8b 87 00 01 00 00 mov 0x100(%rdi),%eax
>> ffffffff804c5256: 236 c1 e0 10 shl $0x10,%eax
>> ffffffff804c5259: 191 89 c2 mov %eax,%edx
>> ffffffff804c525b: 487 23 97 fc 00 00 00 and 0xfc(%rdi),%edx
>> ffffffff804c5261: 362 39 c2 cmp %eax,%edx
>> ffffffff804c5263: 342 75 15 jne ffffffff804c527a <tcp_current_mss+0x43>
>> ffffffff804c5265: 473 45 31 e4 xor %r12d,%r12d
>> ffffffff804c5268: 221 8b 87 00 04 00 00 mov 0x400(%rdi),%eax
>> ffffffff804c526e: 194 3b 87 80 04 00 00 cmp 0x480(%rdi),%eax
>> ffffffff804c5274: 445 41 0f 94 c4 sete %r12b
>> ffffffff804c5278: 261 eb 03 jmp ffffffff804c527d <tcp_current_mss+0x46>
>> ffffffff804c527a: 0 45 31 e4 xor %r12d,%r12d
>> ffffffff804c527d: 185 48 85 c9 test %rcx,%rcx
>> ffffffff804c5280: 686 74 15 je ffffffff804c5297 <tcp_current_mss+0x60>
>> ffffffff804c5282: 1806 8b 71 7c mov 0x7c(%rcx),%esi
>> ffffffff804c5285: 1 3b b3 5c 03 00 00 cmp 0x35c(%rbx),%esi
>> ffffffff804c528b: 21 74 0a je ffffffff804c5297 <tcp_current_mss+0x60>
>> ffffffff804c528d: 0 48 89 df mov %rbx,%rdi
>> ffffffff804c5290: 0 e8 8b fb ff ff callq ffffffff804c4e20 <tcp_sync_mss>
>> ffffffff804c5295: 0 89 c5 mov %eax,%ebp
>> ffffffff804c5297: 864 48 8d 4c 24 28 lea 0x28(%rsp),%rcx
>> ffffffff804c529c: 634 48 8d 54 24 10 lea 0x10(%rsp),%rdx
>> ffffffff804c52a1: 995 31 f6 xor %esi,%esi
>> ffffffff804c52a3: 0 48 89 df mov %rbx,%rdi
>> ffffffff804c52a6: 2 e8 f2 fe ff ff callq ffffffff804c519d <tcp_established_options>
>> ffffffff804c52ab: 859 8b 8b e8 03 00 00 mov 0x3e8(%rbx),%ecx
>> ffffffff804c52b1: 936 83 c0 14 add $0x14,%eax
>> ffffffff804c52b4: 6 0f b7 d1 movzwl %cx,%edx
>> ffffffff804c52b7: 0 39 d0 cmp %edx,%eax
>> ffffffff804c52b9: 911 74 04 je ffffffff804c52bf <tcp_current_mss+0x88>
>> ffffffff804c52bb: 0 29 d0 sub %edx,%eax
>> ffffffff804c52bd: 0 29 c5 sub %eax,%ebp
>> ffffffff804c52bf: 0 45 85 e4 test %r12d,%r12d
>> ffffffff804c52c2: 6894 89 e8 mov %ebp,%eax
>> ffffffff804c52c4: 0 74 38 je ffffffff804c52fe <tcp_current_mss+0xc7>
>> ffffffff804c52c6: 990 48 8b 83 68 03 00 00 mov 0x368(%rbx),%rax
>> ffffffff804c52cd: 642 8b b3 04 01 00 00 mov 0x104(%rbx),%esi
>> ffffffff804c52d3: 3 48 89 df mov %rbx,%rdi
>> ffffffff804c52d6: 240 66 2b 70 30 sub 0x30(%rax),%si
>> ffffffff804c52da: 588 66 2b b3 7e 03 00 00 sub 0x37e(%rbx),%si
>> ffffffff804c52e1: 2 66 29 ce sub %cx,%si
>> ffffffff804c52e4: 284 ff ce dec %esi
>> ffffffff804c52e6: 664 0f b7 f6 movzwl %si,%esi
>> ffffffff804c52e9: 2 e8 0a fb ff ff callq ffffffff804c4df8 <tcp_bound_to_half_wnd>
>> ffffffff804c52ee: 68 0f b7 d0 movzwl %ax,%edx
>> ffffffff804c52f1: 1870 89 c1 mov %eax,%ecx
>> ffffffff804c52f3: 0 89 d0 mov %edx,%eax
>> ffffffff804c52f5: 0 31 d2 xor %edx,%edx
>> ffffffff804c52f7: 2135 f7 f5 div %ebp
>> ffffffff804c52f9: 107010 89 c8 mov %ecx,%eax
>> ffffffff804c52fb: 1670 66 29 d0 sub %dx,%ax
>> ffffffff804c52fe: 0 66 89 83 ea 03 00 00 mov %ax,0x3ea(%rbx)
>> ffffffff804c5305: 4 48 83 c4 30 add $0x30,%rsp
>> ffffffff804c5309: 855 89 e8 mov %ebp,%eax
>> ffffffff804c530b: 0 5b pop %rbx
>> ffffffff804c530c: 797 5d pop %rbp
>> ffffffff804c530d: 0 41 5c pop %r12
>> ffffffff804c530f: 0 c3 retq
>>
>> apparently this division causes 1.0% of tbench overhead:
>>
>> ffffffff804c52f5: 0 31 d2 xor %edx,%edx
>> ffffffff804c52f7: 2135 f7 f5 div %ebp
>> ffffffff804c52f9: 107010 89 c8 mov %ecx,%eax
>>
>> (gdb) list *0xffffffff804c52f7
>> 0xffffffff804c52f7 is in tcp_current_mss (net/ipv4/tcp_output.c:1078).
>> 1073 inet_csk(sk)->icsk_af_ops->net_header_len -
>> 1074 inet_csk(sk)->icsk_ext_hdr_len -
>> 1075 tp->tcp_header_len);
>> 1076
>> 1077 xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
>> 1078 xmit_size_goal -= (xmit_size_goal % mss_now);
>> 1079 }
>> 1080 tp->xmit_size_goal = xmit_size_goal;
>> 1081
>> 1082 return mss_now;
>> (gdb)
>>
>> it's this division:
>>
>> if (doing_tso) {
>> [...]
>> xmit_size_goal -= (xmit_size_goal % mss_now);
>>
>> Has no-one hit this before? Perhaps this is why switching loopback
>> networking to TSO had a performance impact for others?
>
> Yes, I mentioned it later. [...]

i see - i just caught up with some of my inbox from today.

> [...] But apparently you dont read my mails, so I will just stop
> now.

Sorry, i spent my time looking at the profile output.

Ingo

2008-11-17 22:39:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Ingo Molnar a ?crit :
> * Eric Dumazet <[email protected]> wrote:
>
>> Ingo Molnar a ?crit :

>>> it's this division:
>>>
>>> if (doing_tso) {
>>> [...]
>>> xmit_size_goal -= (xmit_size_goal % mss_now);
>>>
>>> Has no-one hit this before? Perhaps this is why switching loopback
>>> networking to TSO had a performance impact for others?
>> Yes, I mentioned it later. [...]
>
> i see - i just caught up with some of my inbox from today.
>
>> [...] But apparently you dont read my mails, so I will just stop
>> now.
>
> Sorry, i spent my time looking at the profile output.
>

No problem Ingo, I am very glad you take so much time to profil kernel ;)

I had too many problems with profilers on my dev machine lately :(

2008-11-17 22:48:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Mon, 17 Nov 2008 17:11:35 +0100
>
> > Ouch, +4% from a oneliner networking change? That's a _huge_ speedup
> > compared to the things we were after in scheduler land.
>
> The scheduler has accounted for at least %10 of the tbench
> regressions at this point, what are you talking about?

yeah, you are probably right when it comes to task migration policy
impact - that can have effects in that range. (and that, you have to
accept, is a fundamentally hard and fragile job to get right, as it
involves observing the past and predicting the future out of it - at
1.3 million events per second)

So above i was just talking about straight scheduling code overhead.
(that cannot have been +10% of the total - as the whole scheduler only
takes 7% total - TLB flush and FPU restore overhead included. Even the
hrtimer bits were about 1% of the total.)

Ingo

2008-11-17 23:42:58

by Eric Dumazet

[permalink] [raw]
Subject: Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 25d62e6..94af6a7 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -128,7 +128,7 @@ static inline void random_ether_addr(u8 *addr)
*
* Compare two ethernet addresses, returns 0 if equal
*/
-static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
+static __always_inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
{
const u16 *a = (const u16 *) addr1;
const u16 *b = (const u16 *) addr2;
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index b9d85af..30b60b2 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -162,7 +162,12 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)

skb->dev = dev;
skb_reset_mac_header(skb);
- skb_pull(skb, ETH_HLEN);
+ /*
+ * Hand coded skb_pull(skb, ETH_HLEN) to avoid a function call
+ */
+ if (likely(skb->len >= ETH_HLEN))
+ __skb_pull(skb, ETH_HLEN);
+
eth = eth_hdr(skb);

if (is_multicast_ether_addr(eth->h_dest)) {


Attachments:
eth_type_trans_speedup.patch (1.03 kB)

2008-11-18 00:02:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28



On Tue, 18 Nov 2008, Eric Dumazet wrote:
> > *
> > * Compare two ethernet addresses, returns 0 if equal
> > */
> > static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
> > {
> > const u16 *a = (const u16 *) addr1;
> > const u16 *b = (const u16 *) addr2;
> >
> > BUILD_BUG_ON(ETH_ALEN != 6);
> > return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;

Btw, at least on some Intel CPU's, it would be faster to do this as a
32-bit xor and a 16-bit xor. And if we can know that there is always 2
bytes at the end (because of how the thing was allocated), it's faster
still to do it as a 64-bit xor and a mask.

And that's true even if the addresses are only 2-byte aligned.

The code that gcc generates for "memcmp()" for a constant-size small data
thing is sadly crap. It always generates a "rep cmpsb", even if the size
is something really trivial like 4 bytes, and even if you compare for
exact equality rather than a smaller/greater-than. Gaah.

Linus

2008-11-18 05:16:56

by David Miller

[permalink] [raw]
Subject: Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Ingo Molnar <[email protected]>
Date: Mon, 17 Nov 2008 22:26:57 +0100

> eth->h_proto access.

Yes, this is the first time a packet is touched on receive.

> Given that this workload does localhost networking, my guess would be
> that eth->h_proto is bouncing around between 16 CPUs? At minimum this
> read-mostly field should be separated from the bouncing bits.

It's the packet contents, there is no way to "seperate it".

And it should be unlikely bouncing on your system under tbench,
the senders and receivers should hang out on the same cpu unless
the something completely stupid is happening.

That's why I like running tbench with a num_threads command
line argument equal to the number of cpus, every cpu gets
the two thread talking to eachother over the TCP socket.

2008-11-18 05:24:04

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Eric Dumazet <[email protected]>
Date: Mon, 17 Nov 2008 23:15:50 +0100

> Yes, I mentioned it later. But apparently you dont read my mails, so
> I will just stop now.

Yeah I was going to mention this too :-/

2008-11-18 05:36:52

by Eric Dumazet

[permalink] [raw]
Subject: Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

David Miller a ?crit :
> From: Ingo Molnar <[email protected]>
> Date: Mon, 17 Nov 2008 22:26:57 +0100
>
>> eth->h_proto access.
>
> Yes, this is the first time a packet is touched on receive.

Well, not exactly, since we do a

if (is_multicast_ether_addr(eth->h_dest)) {
...}

and one of the
compare_ether_addr(eth->h_dest, {dev->dev_addr | dev->broadcast})

probably its a profiling effect...


2008-11-18 07:01:09

by David Miller

[permalink] [raw]
Subject: Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Eric Dumazet <[email protected]>
Date: Tue, 18 Nov 2008 06:35:46 +0100

> David Miller a ?crit :
> > From: Ingo Molnar <[email protected]>
> > Date: Mon, 17 Nov 2008 22:26:57 +0100
> >
> >> eth->h_proto access.
> > Yes, this is the first time a packet is touched on receive.
>
> Well, not exactly, since we do a
>
> if (is_multicast_ether_addr(eth->h_dest)) {
> ...}
>
> and one of the
> compare_ether_addr(eth->h_dest, {dev->dev_addr | dev->broadcast})
>
> probably its a profiling effect...

True.

2008-11-18 08:30:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Mon, 17 Nov 2008 22:26:57 +0100
>
> > eth->h_proto access.
>
> Yes, this is the first time a packet is touched on receive.
>
> > Given that this workload does localhost networking, my guess would be
> > that eth->h_proto is bouncing around between 16 CPUs? At minimum this
> > read-mostly field should be separated from the bouncing bits.
>
> It's the packet contents, there is no way to "seperate it".
>
> And it should be unlikely bouncing on your system under tbench, the
> senders and receivers should hang out on the same cpu unless the
> something completely stupid is happening.
>
> That's why I like running tbench with a num_threads command line
> argument equal to the number of cpus, every cpu gets the two thread
> talking to eachother over the TCP socket.

yeah - and i posted the numbers for that too - it's the same
throughput, within ~1% of noise.

Ingo

2008-11-18 08:35:51

by Eric Dumazet

[permalink] [raw]
Subject: Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 25d62e6..ee0df09 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -136,6 +136,47 @@ static inline unsigned compare_ether_addr(const u8 *addr1, const u8 *addr2)
BUILD_BUG_ON(ETH_ALEN != 6);
return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) != 0;
}
+
+static inline unsigned long zap_last_2bytes(unsigned long value)
+{
+#ifdef __BIG_ENDIAN
+ return value >> 16;
+#else
+ return value << 16;
+#endif
+}
+
+/**
+ * compare_ether_addr_64bits - Compare two Ethernet addresses
+ * @addr1: Pointer to an array of 8 bytes
+ * @addr2: Pointer to an other array of 8 bytes
+ *
+ * Compare two ethernet addresses, returns 0 if equal.
+ * Same result than "memcmp(addr1, addr2, ETH_ALEN)" but without conditional
+ * branches, and possibly long word memory accesses on CPU allowing cheap
+ * unaligned memory reads.
+ * arrays = { byte1, byte2, byte3, byte4, byte6, byte7, pad1, pad2}
+ *
+ * Please note that alignment of addr1 & addr2 is only guaranted to be 16 bits.
+ */
+
+static inline unsigned compare_ether_addr_64bits(const u8 addr1[6+2],
+ const u8 addr2[6+2])
+{
+#if defined(CONFIG_X86)
+ unsigned long fold = *(const unsigned long *)addr1 ^
+ *(const unsigned long *)addr2;
+
+ if (sizeof(fold) == 8)
+ return zap_last_2bytes(fold) != 0;
+
+ fold |= zap_last_2bytes(*(const unsigned long *)(addr1 + 4) ^
+ *(const unsigned long *)(addr2 + 4));
+ return fold != 0;
+#else
+ return compare_ether_addr(addr1, addr2);
+#endif
+}
#endif /* __KERNEL__ */

#endif /* _LINUX_ETHERDEVICE_H */
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index b9d85af..dcfeb9b 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -166,7 +166,7 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
eth = eth_hdr(skb);

if (is_multicast_ether_addr(eth->h_dest)) {
- if (!compare_ether_addr(eth->h_dest, dev->broadcast))
+ if (!compare_ether_addr_64bits(eth->h_dest, dev->broadcast))
skb->pkt_type = PACKET_BROADCAST;
else
skb->pkt_type = PACKET_MULTICAST;
@@ -181,7 +181,7 @@ __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev)
*/

else if (1 /*dev->flags&IFF_PROMISC */ ) {
- if (unlikely(compare_ether_addr(eth->h_dest, dev->dev_addr)))
+ if (unlikely(compare_ether_addr_64bits(eth->h_dest, dev->dev_addr)))
skb->pkt_type = PACKET_OTHERHOST;
}


Attachments:
compare_ether_addr_64bits.patch (2.38 kB)

2008-11-18 08:46:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* David Miller <[email protected]> wrote:

> From: Eric Dumazet <[email protected]>
> Date: Mon, 17 Nov 2008 23:15:50 +0100
>
> > Yes, I mentioned it later. But apparently you dont read my mails,
> > so I will just stop now.
>
> Yeah I was going to mention this too :-/

I spent hours profiling the networking code, and no, i didnt read all
the incoming emails in parallel - i read them after that.

I have established it beyond reasonable doubt that the scheduler is
doing the right thing with the config i've posted. Your "wakeup is two
orders of magnitude more expensive" claim, which got me to measure and
profile this stuff, is not reproducible here and this regression
should not be listed as a scheduler regression.

Ingo

2008-11-18 08:50:15

by Eric Dumazet

[permalink] [raw]
Subject: Re: eth_type_trans(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Ingo Molnar a ?crit :
> * David Miller <[email protected]> wrote:
>
>> From: Ingo Molnar <[email protected]>
>> Date: Mon, 17 Nov 2008 22:26:57 +0100
>>
>>> eth->h_proto access.
>> Yes, this is the first time a packet is touched on receive.
>>
>>> Given that this workload does localhost networking, my guess would be
>>> that eth->h_proto is bouncing around between 16 CPUs? At minimum this
>>> read-mostly field should be separated from the bouncing bits.
>> It's the packet contents, there is no way to "seperate it".
>>
>> And it should be unlikely bouncing on your system under tbench, the
>> senders and receivers should hang out on the same cpu unless the
>> something completely stupid is happening.
>>
>> That's why I like running tbench with a num_threads command line
>> argument equal to the number of cpus, every cpu gets the two thread
>> talking to eachother over the TCP socket.
>
> yeah - and i posted the numbers for that too - it's the same
> throughput, within ~1% of noise.

Thinking once again about loopback driver, I recall a previous attempt
to call netif_receive_skb() instead of netif_rx() and pay the price
of cache line ping-pongs between cpus.

http://kerneltrap.org/mailarchive/linux-netdev/2008/2/21/939644

Maybe we could do that, with a temporary percpu stack, like we do in softirq
when CONFIG_4KSTACKS=y

(arch/x86/kernel/irq_32.c : call_on_stack(func, stack)

And do this only if the current cpu doesnt already use its softirq_stack
(think about loopback re-entering loopback xmit because of TCP ACK for example)

Oh well... black magic, you are going to kill me :)

2008-11-18 09:12:31

by Nick Piggin

[permalink] [raw]
Subject: Re: ip_queue_xmit(): Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

On Tuesday 18 November 2008 07:32, Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
> > 100.000000 total
> > ................
> > 3.356152 ip_queue_xmit

> 30% of the overhead of this function comes from:
>
> ffffffff804b7203: 0 66 c7 43 06 00 00 movw $0x0,0x6(%rbx)
> ffffffff804b7209: 118 0f bf 85 40 02 00 00 movswl 0x240(%rbp),%eax
> ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx
> ffffffff804b7215: 340 85 c0 test %eax,%eax
> ffffffff804b7217: 0 79 06 jns ffffffff804b721f
> <ip_queue_xmit+0x1da> ffffffff804b7219: 107464 8b 82 9c 00 00 00 mov
> 0x9c(%rdx),%eax ffffffff804b721f: 4963 88 43 08 mov
> %al,0x8(%rbx)
>
> the 16-bit movw looks a bit weird. It comes from line 372:
>
> 0xffffffff804b7203 is in ip_queue_xmit (net/ipv4/ip_output.c:372).
> 367 iph = ip_hdr(skb);
> 368 *((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff));
> 369 if (ip_dont_fragment(sk, &rt->u.dst) && !ipfragok)
> 370 iph->frag_off = htons(IP_DF);
> 371 else
> 372 iph->frag_off = 0;
> 373 iph->ttl = ip_select_ttl(inet, &rt->u.dst);
> 374 iph->protocol = sk->sk_protocol;
> 375 iph->saddr = rt->rt_src;
> 376 iph->daddr = rt->rt_dst;
>
> the ip-header fragment flag setting to zero.
>
> 16-bit ops are an on-off love/hate affair on x86 CPUs. The trend is
> towards eliminating them as much as possible.
>
> _But_, the real overhead probably comes from:
>
> ffffffff804b7210: 10867 48 8b 54 24 58 mov 0x58(%rsp),%rdx
>
> which is the next line, the ttl field:
>
> 373 iph->ttl = ip_select_ttl(inet, &rt->u.dst);
>
> this shows that we are doing a hard cachemiss on the net-localhost
> route dst structure cacheline. We do a plain load instruction from it
> here and get a hefty cachemiss. (because 16 CPUs are banging on that
> single route)

Why would that show up right there, though? Instruction like this should
be non-blocking. Shouldn't the cost should show up at some point where the
CPU executes an instruction depending on rdx? (and good luck working out
when that happens!)

2008-11-18 09:44:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

On Tuesday 18 November 2008 07:58, David Miller wrote:
> From: Linus Torvalds <[email protected]>
> Date: Mon, 17 Nov 2008 12:30:00 -0800 (PST)
>
> > On Mon, 17 Nov 2008, David Miller wrote:
> > > It's on my workstation which is a much simpler 2 processor
> > > UltraSPARC-IIIi (1.5Ghz) system.
> >
> > Ok. It could easily be something like a cache footprint issue. And while
> > I don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is
> > super- scalar but does no out-of-order and speculation, no?
>
> I does only very simple speculation, but you're description is accurate.

Surely it would do branch prediction, but maybe not indirect branch?
I did wonder why those indirect function calls were added everywhere
in the scheduler...

They didn't show up in the newest generation of x86 CPUs, but simpler
implementations won't handle them as well.

I wouldn't expect that to cause such a big regression on its own, but
it would still be interesting to test changing them to direct calls.

2008-11-18 12:29:23

by Mike Galbraith

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

On Mon, 2008-11-17 at 11:39 -0800, David Miller wrote:
> From: Ingo Molnar <[email protected]>
> Date: Mon, 17 Nov 2008 19:49:51 +0100
>
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > 4> The place for the sock_rfree() hit looks a bit weird, and i'll
> > > investigate it now a bit more to place the real overhead point
> > > properly. (i already mapped the test-bit overhead: that comes from
> > > napi_disable_pending())
> >
> > ok, here's a new set of profiles. (again for tbench 64-thread on a
> > 16-way box, with v2.6.28-rc5-19-ge14c8bf and with the kernel config i
> > posted before.)
>
> Again, do a non-NMI profile and the top (at least for me)
> looks like this:
>
> samples % app name symbol name
> 473 6.3928 vmlinux finish_task_switch
> 349 4.7169 vmlinux tcp_v4_rcv
> 327 4.4195 vmlinux U3copy_from_user
> 322 4.3519 vmlinux tl0_linux32
> 178 2.4057 vmlinux tcp_ack
> 170 2.2976 vmlinux tcp_sendmsg
> 167 2.2571 vmlinux U3copy_to_user
>
> That tcp_v4_rcv() hit is %98 on the wake_up() call it does.

Easy enough, since i don't know how to do spiffy NMI profile.. yet ;-)

I revived the 2.6.25 kernel where I tested back-ports of recent sched
fixes, and did a non-NMI profile of 2.6.22.19 and the back-port kernel.

The test kernel has all clock fixes 25->.git, min_vruntime accuracy fix
native_read_tsc() fix, and back looking buddy. No knobs turned, and
only testing one pair per CPU, as to not take unfair advantage of back
looking buddy. Netperf TCP_RR (hits sched harder) looks about the same.

Tbench 4 throughput was so close you would call these two twins.

2.6.22.19-smp
CPU: Core 2, speed 2400 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
vma samples % symbol name
ffffffff802e6670 575909 13.7425 copy_user_generic_string
ffffffff80422ad8 175649 4.1914 schedule
ffffffff803a522d 133152 3.1773 tcp_sendmsg
ffffffff803a9387 128911 3.0761 tcp_ack
ffffffff803b65f7 116562 2.7814 tcp_v4_rcv
ffffffff803aeac8 116541 2.7809 tcp_transmit_skb
ffffffff8039eb95 112133 2.6757 ip_queue_xmit
ffffffff80209e20 110945 2.6474 system_call
ffffffff8037b720 108277 2.5837 __kfree_skb
ffffffff803a65cd 105493 2.5173 tcp_recvmsg
ffffffff80210f87 97947 2.3372 read_tsc
ffffffff802085b6 95255 2.2730 __switch_to
ffffffff803803f1 82069 1.9584 netif_rx
ffffffff8039f645 80937 1.9313 ip_output
ffffffff8027617d 74585 1.7798 __slab_alloc
ffffffff803824a0 70928 1.6925 process_backlog
ffffffff803ad9a5 69574 1.6602 tcp_rcv_established
ffffffff80399d40 55453 1.3232 ip_rcv
ffffffff803b07d1 53256 1.2708 __tcp_push_pending_frames
ffffffff8037b49c 52565 1.2543 skb_clone
ffffffff80276e97 49690 1.1857 __kmalloc_track_caller
ffffffff80379d05 45450 1.0845 sock_wfree
ffffffff80223d82 44851 1.0702 effective_prio
ffffffff803826b6 42417 1.0122 net_rx_action
ffffffff8027684c 42341 1.0104 kfree

2.6.25.20-test-smp
CPU: Core 2, speed 2400 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
vma samples % symbol name
ffffffff80301450 576125 14.0874 copy_user_generic_string
ffffffff803cf8d9 127997 3.1298 tcp_transmit_skb
ffffffff803c9eac 125402 3.0663 tcp_ack
ffffffff80454da3 122337 2.9914 schedule
ffffffff803c673c 120401 2.9440 tcp_sendmsg
ffffffff8039aa9e 116554 2.8500 skb_release_all
ffffffff803c5abb 104840 2.5635 tcp_recvmsg
ffffffff8020a63d 92180 2.2540 __switch_to
ffffffff8020be20 79703 1.9489 system_call
ffffffff803bf460 79384 1.9411 ip_queue_xmit
ffffffff803a005c 78035 1.9081 netif_rx
ffffffff803ce56b 71223 1.7415 tcp_rcv_established
ffffffff8039ff70 66493 1.6259 process_backlog
ffffffff803d5a2d 61635 1.5071 tcp_v4_rcv
ffffffff803c1dae 60889 1.4889 __inet_lookup_established
ffffffff802126bc 54711 1.3378 native_read_tsc
ffffffff803d23bc 51843 1.2677 __tcp_push_pending_frames
ffffffff803bfb24 51821 1.2671 ip_finish_output
ffffffff8023700c 48248 1.1798 local_bh_enable
ffffffff803979bc 42221 1.0324 sock_wfree
ffffffff8039b12c 41279 1.0094 __alloc_skb

2008-11-18 16:00:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28



On Tue, 18 Nov 2008, Nick Piggin wrote:

> On Tuesday 18 November 2008 07:58, David Miller wrote:
> > From: Linus Torvalds <[email protected]>
> > >
> > > Ok. It could easily be something like a cache footprint issue. And while
> > > I don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is
> > > super- scalar but does no out-of-order and speculation, no?
> >
> > I does only very simple speculation, but you're description is accurate.
>
> Surely it would do branch prediction, but maybe not indirect branch?

That would be "branch target prediction" (and a BTB - "Branch Target
Buffer" to hold it), and no, I don't think Sparc does that. You can
certainly do it for in-order machines too, but I think it's fairly rare.

It's sufficiently different from the regular "pick up the address from the
static instruction stream, and also yank the kill-chain on mispredicted
direction" to be real work to do. Unlike a compare or test instruction,
it's not at all likely that you can resolve the final address in just a
single pipeline stage, and without that, it's usually too late to yank the
kill-chain.

(And perhaps equally importantly, indirect branches are relatively rare on
old-style Unix benchmarks - ie SpecInt/FP - or in databases. So it's not
something that Sparc would necessarily have spent the effort on.)

There is obviously one very special indirect jump: "ret". That's the one
that is common, and that tends to have a special branch target buffer that
is a pure stack. And for that, there is usually a special branch target
register that needs to be set up 'x' cycles before the ret in order to
avoid the stall (then the predition is checking that register against the
branch target stack, which is somewhat akin to a regular conditional
branch comparison).

So I strongly suspect that an indirect (non-ret) branch flushes the
pipeline on sparc. It is possible that there is a "prepare to jump"
instruction that prepares the indirect branch stack (kind of a "push
prediction information"). I suspect Java sees a lot more indirect
branches than traditional Unix loads, so maybe Sun did do that.

Linus

2008-11-19 04:32:00

by Nick Piggin

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

On Wednesday 19 November 2008 02:58, Linus Torvalds wrote:
> On Tue, 18 Nov 2008, Nick Piggin wrote:
> > On Tuesday 18 November 2008 07:58, David Miller wrote:
> > > From: Linus Torvalds <[email protected]>
> > >
> > > > Ok. It could easily be something like a cache footprint issue. And
> > > > while I don't know my sparc cpu's very well, I think the
> > > > Ultrasparc-IIIi is super- scalar but does no out-of-order and
> > > > speculation, no?
> > >
> > > I does only very simple speculation, but you're description is
> > > accurate.
> >
> > Surely it would do branch prediction, but maybe not indirect branch?
>
> That would be "branch target prediction" (and a BTB - "Branch Target
> Buffer" to hold it), and no, I don't think Sparc does that. You can
> certainly do it for in-order machines too, but I think it's fairly rare.
>
> It's sufficiently different from the regular "pick up the address from the
> static instruction stream, and also yank the kill-chain on mispredicted
> direction" to be real work to do. Unlike a compare or test instruction,
> it's not at all likely that you can resolve the final address in just a
> single pipeline stage, and without that, it's usually too late to yank the
> kill-chain.
>
> (And perhaps equally importantly, indirect branches are relatively rare on
> old-style Unix benchmarks - ie SpecInt/FP - or in databases. So it's not
> something that Sparc would necessarily have spent the effort on.)
>
> There is obviously one very special indirect jump: "ret". That's the one
> that is common, and that tends to have a special branch target buffer that
> is a pure stack. And for that, there is usually a special branch target
> register that needs to be set up 'x' cycles before the ret in order to
> avoid the stall (then the predition is checking that register against the
> branch target stack, which is somewhat akin to a regular conditional
> branch comparison).
>
> So I strongly suspect that an indirect (non-ret) branch flushes the
> pipeline on sparc. It is possible that there is a "prepare to jump"
> instruction that prepares the indirect branch stack (kind of a "push
> prediction information"). I suspect Java sees a lot more indirect
> branches than traditional Unix loads, so maybe Sun did do that.

Probably true. OTOH, I've seen indirect branches get compiled to direct
branches or the common-case special cased into a direct branch

if (object->fn == default_object_fn)
default_object_fn();

That might be an easy way to test suspicions about CPU scheduler
slowdowns... (adding a likely() there, and using likely profiling would
help ensure you got the defualt case right).

2008-11-19 19:44:20

by Christoph Lameter

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

On Mon, 17 Nov 2008, Ingo Molnar wrote:

> Christoph, as per the recent analysis of Mike:
>
> http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
>
> all scheduler components of this regression have been eliminated.
>
> In fact his numbers show that scheduler speedups since 2.6.22 have
> offset and hidden most other sources of tbench regression. (i.e. the
> scheduler portion got 5% faster, hence it was able to offset a
> slowdown of 5% in other areas of the kernel that tbench triggers)

Ok will rerun the tests tomorrow. Just got back from SC08 need some time
to catch up.

Looks like a lot of work was done on this issue. Thanks!

2008-11-19 20:15:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Christoph Lameter <[email protected]> wrote:

> On Mon, 17 Nov 2008, Ingo Molnar wrote:
>
> > Christoph, as per the recent analysis of Mike:
> >
> > http://fixunix.com/kernel/556867-regression-benchmark-throughput-loss-a622cf6-f7160c7-pull.html
> >
> > all scheduler components of this regression have been eliminated.
> >
> > In fact his numbers show that scheduler speedups since 2.6.22 have
> > offset and hidden most other sources of tbench regression. (i.e. the
> > scheduler portion got 5% faster, hence it was able to offset a
> > slowdown of 5% in other areas of the kernel that tbench triggers)
>
> Ok will rerun the tests tomorrow. Just got back from SC08 need some
> time to catch up.
>
> Looks like a lot of work was done on this issue. Thanks!

You might also want to try net-next:

[remote "net-next"]
url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
fetch = +refs/heads/*:refs/remotes/net-next/*

Some good stuff is in there too, impacting this workload.

Ingo

2008-11-20 09:06:24

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Nick Piggin <[email protected]>
Date: Tue, 18 Nov 2008 20:44:10 +1100

> On Tuesday 18 November 2008 07:58, David Miller wrote:
> > From: Linus Torvalds <[email protected]>
> > Date: Mon, 17 Nov 2008 12:30:00 -0800 (PST)
> >
> > > On Mon, 17 Nov 2008, David Miller wrote:
> > > > It's on my workstation which is a much simpler 2 processor
> > > > UltraSPARC-IIIi (1.5Ghz) system.
> > >
> > > Ok. It could easily be something like a cache footprint issue. And while
> > > I don't know my sparc cpu's very well, I think the Ultrasparc-IIIi is
> > > super- scalar but does no out-of-order and speculation, no?
> >
> > I does only very simple speculation, but you're description is accurate.
>
> Surely it would do branch prediction, but maybe not indirect branch?

Right.

2008-11-20 09:14:29

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Linus Torvalds <[email protected]>
Date: Tue, 18 Nov 2008 07:58:49 -0800 (PST)

> There is obviously one very special indirect jump: "ret". That's the one
> that is common, and that tends to have a special branch target buffer that
> is a pure stack. And for that, there is usually a special branch target
> register that needs to be set up 'x' cycles before the ret in order to
> avoid the stall (then the predition is checking that register against the
> branch target stack, which is somewhat akin to a regular conditional
> branch comparison).

Yes, UltraSPARC has a RAS or Return Address Stack. I think it has
effectively zero latency (ie. you can call some function, immediately
"ret" and it hits the RAS). This is probably because, due to delay slots,
there is always going to be one instruction in between anyways. :)

> So I strongly suspect that an indirect (non-ret) branch flushes the
> pipeline on sparc. It is possible that there is a "prepare to jump"
> instruction that prepares the indirect branch stack (kind of a "push
> prediction information").

It doesn't flush the pipeline, it just stalls it waiting for the
address computation.

Branches are predicted and can execute in the same cycle as the
condition-code setting instruction they depend upon.

> I suspect Java sees a lot more indirect branches than traditional
> Unix loads, so maybe Sun did do that.

There really isn't anything special done here for indirect jumps,
other than pushing onto the RAS. Indirects just suck :)

2008-11-20 23:53:35

by Christoph Lameter

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

hmmm... Well we are almost there.

2.6.22:

Throughput 2526.15 MB/sec 8 procs

2.6.28-rc5:

Throughput 2486.2 MB/sec 8 procs

8p Dell 1950 and the number of processors specified on the tbench command
line.

2008-11-21 08:31:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Christoph Lameter <[email protected]> wrote:

> hmmm... Well we are almost there.
>
> 2.6.22:
>
> Throughput 2526.15 MB/sec 8 procs
>
> 2.6.28-rc5:
>
> Throughput 2486.2 MB/sec 8 procs
>
> 8p Dell 1950 and the number of processors specified on the tbench
> command line.

And with net-next we might even be able to get past that magic limit?
net-next is linus-latest plus the latest and greatest networking bits:

$ cat .git/config

[remote "net-next"]
url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
fetch = +refs/heads/*:refs/remotes/net-next/*

... so might be worth a test. Just to satisfy our curiosity and to
possibly close the entry :-)

Ingo

2008-11-21 08:53:57

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Ingo Molnar a ?crit :
> * Christoph Lameter <[email protected]> wrote:
>
>> hmmm... Well we are almost there.
>>
>> 2.6.22:
>>
>> Throughput 2526.15 MB/sec 8 procs
>>
>> 2.6.28-rc5:
>>
>> Throughput 2486.2 MB/sec 8 procs
>>
>> 8p Dell 1950 and the number of processors specified on the tbench
>> command line.
>
> And with net-next we might even be able to get past that magic limit?
> net-next is linus-latest plus the latest and greatest networking bits:
>
> $ cat .git/config
>
> [remote "net-next"]
> url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
> fetch = +refs/heads/*:refs/remotes/net-next/*
>
> ... so might be worth a test. Just to satisfy our curiosity and to
> possibly close the entry :-)
>

Well, bits in net-next are new stuff for 2.6.29, not really regression fixes,
but yes, they should give nice tbench speedups.


Now, I wish sockets and pipes not going through dcache, not tbench affair
of course but real workloads...

running 8 processes on a 8 way machine doing a

for (;;)
close(socket(AF_INET, SOCK_STREAM, 0));

is slow as hell, we hit so many contended cache lines ...

ticket spin locks are slower in this case (dcache_lock for example
is taken twice when we allocate a socket(), once in d_alloc(), another one
in d_instantiate())

2008-11-21 09:03:45

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Ingo Molnar <[email protected]>
Date: Fri, 21 Nov 2008 09:30:44 +0100

>
> * Christoph Lameter <[email protected]> wrote:
>
> > hmmm... Well we are almost there.
> >
> > 2.6.22:
> >
> > Throughput 2526.15 MB/sec 8 procs
> >
> > 2.6.28-rc5:
> >
> > Throughput 2486.2 MB/sec 8 procs
> >
> > 8p Dell 1950 and the number of processors specified on the tbench
> > command line.
>
> And with net-next we might even be able to get past that magic limit?
> net-next is linus-latest plus the latest and greatest networking bits:

In any event I'm happy to toss this from the regression list.

My sparc still shows the issues and I'll profile that independently.
I'm pretty sure it's the indirect calls and the deeper stack frames
(which == 128 bytes of extra stores at each level to save the register
window), but I need to prove that first.

2008-11-21 09:05:37

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

From: Eric Dumazet <[email protected]>
Date: Fri, 21 Nov 2008 09:51:32 +0100

> Now, I wish sockets and pipes not going through dcache, not tbench affair
> of course but real workloads...
>
> running 8 processes on a 8 way machine doing a
>
> for (;;)
> close(socket(AF_INET, SOCK_STREAM, 0));
>
> is slow as hell, we hit so many contended cache lines ...
>
> ticket spin locks are slower in this case (dcache_lock for example
> is taken twice when we allocate a socket(), once in d_alloc(), another one
> in d_instantiate())

As you of course know, this used to be a ton worse. At least now
these things are unhashed. :)

2008-11-21 09:19:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28


* Eric Dumazet <[email protected]> wrote:

> Ingo Molnar a ?crit :
>> * Christoph Lameter <[email protected]> wrote:
>>
>>> hmmm... Well we are almost there.
>>>
>>> 2.6.22:
>>>
>>> Throughput 2526.15 MB/sec 8 procs
>>>
>>> 2.6.28-rc5:
>>>
>>> Throughput 2486.2 MB/sec 8 procs
>>>
>>> 8p Dell 1950 and the number of processors specified on the tbench
>>> command line.
>>
>> And with net-next we might even be able to get past that magic limit?
>> net-next is linus-latest plus the latest and greatest networking bits:
>>
>> $ cat .git/config
>>
>> [remote "net-next"]
>> url = git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6.git
>> fetch = +refs/heads/*:refs/remotes/net-next/*
>>
>> ... so might be worth a test. Just to satisfy our curiosity and to
>> possibly close the entry :-)
>>
>
> Well, bits in net-next are new stuff for 2.6.29, not really
> regression fixes, but yes, they should give nice tbench speedups.

yeah, i know - technically these are lots-of-kernel-releases effects
so not bona fide latest-cycle regressions anyway. But it doesnt matter
how we call them, we want improvement in these metrics.

> Now, I wish sockets and pipes not going through dcache, not tbench
> affair of course but real workloads...
>
> running 8 processes on a 8 way machine doing a
>
> for (;;)
> close(socket(AF_INET, SOCK_STREAM, 0));
>
> is slow as hell, we hit so many contended cache lines ...
>
> ticket spin locks are slower in this case (dcache_lock for example
> is taken twice when we allocate a socket(), once in d_alloc(),
> another one in d_instantiate())

hm, weird - since there's no real VFS namespace impact i fail to
realize the fundamental need that causes us to hit the dcache_lock.
(perhaps there's none and this is fixable)

The general concept of mapping sockets to fds is a fundamental and
powerful abstraction. There are APIs that also connect them to the VFS
namespace (such as unix domain sockets) - but those should be special
cases, not impacting normal TCP sockets.

Ingo

2008-11-21 12:52:18

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

David Miller a ?crit :
> From: Eric Dumazet <[email protected]>
> Date: Fri, 21 Nov 2008 09:51:32 +0100
>
>> Now, I wish sockets and pipes not going through dcache, not tbench affair
>> of course but real workloads...
>>
>> running 8 processes on a 8 way machine doing a
>>
>> for (;;)
>> close(socket(AF_INET, SOCK_STREAM, 0));
>>
>> is slow as hell, we hit so many contended cache lines ...
>>
>> ticket spin locks are slower in this case (dcache_lock for example
>> is taken twice when we allocate a socket(), once in d_alloc(), another one
>> in d_instantiate())
>
> As you of course know, this used to be a ton worse. At least now
> these things are unhashed. :)

Well, this is dust compared to what we currently have.

To allocate a socket we :
0) Do the usual file manipulation (pretty scalable these days)
(but recent drop_file_write_access() and co slow down a bit)
1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list (for sockets, I doubt it is usefull)
- dirties superblock s_inodes.
- dirties last_ino counter
All these are in different cache lines of course.
2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry
3) d_instantiate() dentry (dcache_lock taken again)
4) init_file() -> atomic_inc on sock_mnt->refcount (in case we want to umount this vfs ...)



At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list.

for (i = 0; i < 1000*1000; i++)
close(socket(socket(AF_INET, SOCK_STREAM, 0));

Cost if run one one cpu :

real 0m1.561s
user 0m0.092s
sys 0m1.469s

If run on 8 CPUS :

real 0m27.496s
user 0m0.657s
sys 3m39.092s


CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples cum. samples % cum. % symbol name
164211 164211 10.9678 10.9678 init_file
155663 319874 10.3969 21.3647 d_alloc
147596 467470 9.8581 31.2228 _atomic_dec_and_lock
92993 560463 6.2111 37.4339 inet_create
73495 633958 4.9088 42.3427 kmem_cache_alloc
46353 680311 3.0960 45.4387 dentry_iput
46042 726353 3.0752 48.5139 tcp_close
42784 769137 2.8576 51.3715 kmem_cache_free
37074 806211 2.4762 53.8477 wake_up_inode
36375 842586 2.4295 56.2772 tcp_v4_init_sock
35212 877798 2.3518 58.6291 inotify_d_instantiate
33199 910997 2.2174 60.8465 sysenter_past_esp
31161 942158 2.0813 62.9277 d_instantiate
31000 973158 2.0705 64.9983 generic_forget_inode
28020 1001178 1.8715 66.8698 vfs_dq_drop
19007 1020185 1.2695 68.1393 __copy_from_user_ll
17513 1037698 1.1697 69.3090 new_inode
16957 1054655 1.1326 70.4415 __init_timer
16897 1071552 1.1286 71.5701 discard_slab
16115 1087667 1.0763 72.6464 d_kill
15542 1103209 1.0381 73.6845 __percpu_counter_add
13562 1116771 0.9058 74.5903 __slab_free
13276 1130047 0.8867 75.4771 __fput
12423 1142470 0.8297 76.3068 new_slab
11976 1154446 0.7999 77.1067 tcp_v4_destroy_sock
10889 1165335 0.7273 77.8340 inet_csk_destroy_sock
10516 1175851 0.7024 78.5364 alloc_inode
9979 1185830 0.6665 79.2029 sock_attach_fd
7980 1193810 0.5330 79.7359 drop_file_write_access
7609 1201419 0.5082 80.2441 alloc_fd
7584 1209003 0.5065 80.7506 sock_init_data
7164 1216167 0.4785 81.2291 add_partial
7107 1223274 0.4747 81.7038 sys_close
6997 1230271 0.4673 82.1711 mwait_idle

2008-11-21 15:14:25

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] fs: pipe/sockets/anon dentries should not have a parent

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..22cce87 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -92,7 +92,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
this.name = name;
this.len = strlen(name);
this.hash = 0;
- dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+ dentry = d_alloc(NULL, &this);
if (!dentry)
goto err_put_unused_fd;

diff --git a/fs/dnotify.c b/fs/dnotify.c
index 676073b..66066a3 100644
--- a/fs/dnotify.c
+++ b/fs/dnotify.c
@@ -173,7 +173,7 @@ void dnotify_parent(struct dentry *dentry, unsigned long event)

spin_lock(&dentry->d_lock);
parent = dentry->d_parent;
- if (parent->d_inode->i_dnotify_mask & event) {
+ if (parent && parent->d_inode->i_dnotify_mask & event) {
dget(parent);
spin_unlock(&dentry->d_lock);
__inode_dir_notify(parent->d_inode, event);
diff --git a/fs/inotify.c b/fs/inotify.c
index 7bbed1b..9f051bb 100644
--- a/fs/inotify.c
+++ b/fs/inotify.c
@@ -270,7 +270,7 @@ void inotify_d_instantiate(struct dentry *entry, struct inode *inode)

spin_lock(&entry->d_lock);
parent = entry->d_parent;
- if (parent->d_inode && inotify_inode_watched(parent->d_inode))
+ if (parent && parent->d_inode && inotify_inode_watched(parent->d_inode))
entry->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
spin_unlock(&entry->d_lock);
}
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..4b961bc 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -926,7 +926,7 @@ struct file *create_write_pipe(int flags)
goto err;

err = -ENOMEM;
- dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+ dentry = d_alloc(NULL, &name);
if (!dentry)
goto err_inode;

diff --git a/net/socket.c b/net/socket.c
index e9d65ea..b84de7d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -373,7 +373,7 @@ static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
struct dentry *dentry;
struct qstr name = { .name = "" };

- dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+ dentry = d_alloc(NULL, &name);
if (unlikely(!dentry))
return -ENOMEM;


Attachments:
null_parent.patch (2.03 kB)

2008-11-21 15:22:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent


* Eric Dumazet <[email protected]> wrote:

> Before patch, time to run 8 millions of close(socket()) calls on 8
> CPUS was :
>
> real 0m27.496s
> user 0m0.657s
> sys 3m39.092s
>
> After patch :
>
> real 0m23.997s
> user 0m0.682s
> sys 3m11.193s

cool :-)

What would it take to get it down to:

>> Cost if run one one cpu :
>>
>> real 0m1.561s
>> user 0m0.092s
>> sys 0m1.469s

i guess asking for a wall-clock cost of 1.561/8 would be too much? :)

Ingo

2008-11-21 15:29:29

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent

Ingo Molnar a ?crit :
> * Eric Dumazet <[email protected]> wrote:
>
>> Before patch, time to run 8 millions of close(socket()) calls on 8
>> CPUS was :
>>
>> real 0m27.496s
>> user 0m0.657s
>> sys 3m39.092s
>>
>> After patch :
>>
>> real 0m23.997s
>> user 0m0.682s
>> sys 3m11.193s
>
> cool :-)
>
> What would it take to get it down to:
>
>>> Cost if run one one cpu :
>>>
>>> real 0m1.561s
>>> user 0m0.092s
>>> sys 0m1.469s
>
> i guess asking for a wall-clock cost of 1.561/8 would be too much? :)
>

It might be possible, depending on the level of hackery I am allowed to inject
in fs/dcache.c and fs/inode.c :)

wall cost of 1.56 (each cpu runs one loop of one million iterations)

2008-11-21 15:36:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent


* Eric Dumazet <[email protected]> wrote:

> Ingo Molnar a ?crit :
>> * Eric Dumazet <[email protected]> wrote:
>>
>>> Before patch, time to run 8 millions of close(socket()) calls on 8
>>> CPUS was :
>>>
>>> real 0m27.496s
>>> user 0m0.657s
>>> sys 3m39.092s
>>>
>>> After patch :
>>>
>>> real 0m23.997s
>>> user 0m0.682s
>>> sys 3m11.193s
>>
>> cool :-)
>>
>> What would it take to get it down to:
>>
>>>> Cost if run one one cpu :
>>>>
>>>> real 0m1.561s
>>>> user 0m0.092s
>>>> sys 0m1.469s
>>
>> i guess asking for a wall-clock cost of 1.561/8 would be too much? :)
>>
>
> It might be possible, depending on the level of hackery I am allowed
> to inject in fs/dcache.c and fs/inode.c :)

I think being able to open+close sockets in a scalable way is an
undisputed prime-time workload on Linux. The numbers you showed look
horrible.

Once you can show how much faster it could go via hacks, it should
only be a matter of time to achieve that safely and cleanly.

> wall cost of 1.56 (each cpu runs one loop of one million iterations)

(indeed.)

Ingo

2008-11-21 15:37:25

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent

On Fri, Nov 21, 2008 at 04:13:38PM +0100, Eric Dumazet wrote:
> [PATCH] fs: pipe/sockets/anon dentries should not have a parent
>
> Linking pipe/sockets/anon dentries to one root 'parent' has no functional
> impact at all, but a scalability one.
>
> We can avoid touching a cache line at allocation stage (inside d_alloc(), no need
> to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count)
> We avoid an expensive atomic_dec_and_lock() call on the root dentry.
>
> If we correct dnotify_parent() and inotify_d_instantiate() to take into account
> a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry.

Sorry folks, but a NULL d_parent is a no-go from the VFS perspective,
but you can set d_parent to the dentry itself which is the magic used
for root of tree dentries. They should also be marked
DCACHE_DISCONNECTED to make sure this is not unexpected.

And this kind of stuff really needs to go through -fsdevel.

2008-11-21 16:12:51

by Christoph Lameter

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

On Fri, 21 Nov 2008, Ingo Molnar wrote:

> > 2.6.22:
> > Throughput 2526.15 MB/sec 8 procs
> > 2.6.28-rc5:
> > Throughput 2486.2 MB/sec 8 procs
> >
> > 8p Dell 1950 and the number of processors specified on the tbench
> > command line.
>
> ... so might be worth a test. Just to satisfy our curiosity and to
> possibly close the entry :-)

Ahh.. Wow.... net-next gets us:

Throughput 2685.17 MB/sec 8 procs

2008-11-21 17:59:46

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..9fd0515 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -71,7 +71,6 @@ static struct dentry_operations anon_inodefs_dentry_operations = {
int anon_inode_getfd(const char *name, const struct file_operations *fops,
void *priv, int flags)
{
- struct qstr this;
struct dentry *dentry;
struct file *file;
int error, fd;
@@ -89,10 +88,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
* using the inode sequence number.
*/
error = -ENOMEM;
- this.name = name;
- this.len = strlen(name);
- this.hash = 0;
- dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+ dentry = d_alloc_unhashed(name, anon_inode_inode);
if (!dentry)
goto err_put_unused_fd;

@@ -104,9 +100,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
atomic_inc(&anon_inode_inode->i_count);

dentry->d_op = &anon_inodefs_dentry_operations;
- /* Do not publish this dentry inside the global dentry hash table */
- dentry->d_flags &= ~DCACHE_UNHASHED;
- d_instantiate(dentry, anon_inode_inode);

error = -ENFILE;
file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index a1d86c7..a5477fd 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1111,6 +1111,37 @@ struct dentry * d_alloc_root(struct inode * root_inode)
return res;
}

+/**
+ * d_alloc_unhashed - allocate unhashed dentry
+ * @inode: inode to allocate the dentry for
+ * @name: dentry name
+ *
+ * Allocate an unhashed dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory. Unhashed dentries have themselves as a parent.
+ */
+
+struct dentry * d_alloc_unhashed(const char *name, struct inode *inode)
+{
+ struct qstr q = { .name = name, .len = strlen(name) };
+ struct dentry *res;
+
+ res = d_alloc(NULL, &q);
+ if (res) {
+ res->d_sb = inode->i_sb;
+ res->d_parent = res;
+ /*
+ * We dont want to push this dentry into global dentry hash table.
+ * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
+ * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon
+ */
+ res->d_flags &= ~DCACHE_UNHASHED;
+ res->d_flags |= DCACHE_DISCONNECTED;
+ d_instantiate(res, inode);
+ }
+ return res;
+}
+
static inline struct hlist_head *d_hash(struct dentry *parent,
unsigned long hash)
{
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..29fcac2 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -918,7 +918,6 @@ struct file *create_write_pipe(int flags)
struct inode *inode;
struct file *f;
struct dentry *dentry;
- struct qstr name = { .name = "" };

err = -ENFILE;
inode = get_pipe_inode();
@@ -926,18 +925,11 @@ struct file *create_write_pipe(int flags)
goto err;

err = -ENOMEM;
- dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+ dentry = d_alloc_unhashed("", inode);
if (!dentry)
goto err_inode;

dentry->d_op = &pipefs_dentry_operations;
- /*
- * We dont want to publish this dentry into global dentry hash table.
- * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
- * This permits a working /proc/$pid/fd/XXX on pipes
- */
- dentry->d_flags &= ~DCACHE_UNHASHED;
- d_instantiate(dentry, inode);

err = -ENFILE;
f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..12438d6 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -238,6 +238,7 @@ extern int d_invalidate(struct dentry *);

/* only used at mount-time */
extern struct dentry * d_alloc_root(struct inode *);
+extern struct dentry * d_alloc_unhashed(const char *, struct inode *);

/* <clickety>-<click> the ramfs-type tree */
extern void d_genocide(struct dentry *);
diff --git a/net/socket.c b/net/socket.c
index e9d65ea..b659b5d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -371,20 +371,12 @@ static int sock_alloc_fd(struct file **filep, int flags)
static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
{
struct dentry *dentry;
- struct qstr name = { .name = "" };

- dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+ dentry = d_alloc_unhashed("", SOCK_INODE(sock));
if (unlikely(!dentry))
return -ENOMEM;

dentry->d_op = &sockfs_dentry_operations;
- /*
- * We dont want to push this dentry into global dentry hash table.
- * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
- * This permits a working /proc/$pid/fd/XXX on sockets
- */
- dentry->d_flags &= ~DCACHE_UNHASHED;
- d_instantiate(dentry, SOCK_INODE(sock));

sock->file = file;
init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,


Attachments:
d_alloc_unhashed.patch (4.62 kB)

2008-11-21 18:07:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

AIM9 results:
TCP UDP
2.6.22 104868.00 489970.03
2.6.28-rc5 110007.00 518640.00
net-next 108207.00 514790.00

net-next looses here for some reason against 2.6.28-rc5. But the numbers
are better than 2.6.22 in any case.



2008-11-21 18:17:39

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Christoph Lameter a ?crit :
> AIM9 results:
> TCP UDP
> 2.6.22 104868.00 489970.03
> 2.6.28-rc5 110007.00 518640.00
> net-next 108207.00 514790.00
>
> net-next looses here for some reason against 2.6.28-rc5. But the numbers
> are better than 2.6.22 in any case.
>

I found that on current net-next, running oprofile in background can give better bench
results. Thats really curious... no ?


So the single loop on close(socket()), on all my 8 cpus is almost 10% faster if oprofile
is running... (20 secs instead of 23 secs)

2008-11-21 18:20:11

by Eric Dumazet

[permalink] [raw]
Subject: Re: [Bug #11308] tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28

Eric Dumazet a ?crit :
> Christoph Lameter a ?crit :
>> AIM9 results:
>> TCP UDP
>> 2.6.22 104868.00 489970.03
>> 2.6.28-rc5 110007.00 518640.00
>> net-next 108207.00 514790.00
>>
>> net-next looses here for some reason against 2.6.28-rc5. But the numbers
>> are better than 2.6.22 in any case.
>>
>
> I found that on current net-next, running oprofile in background can
> give better bench
> results. Thats really curious... no ?
>
>
> So the single loop on close(socket()), on all my 8 cpus is almost 10%
> faster if oprofile
> is running... (20 secs instead of 23 secs)
>

Oh well, thats normal, since when a cpu is interrupted by a NMI, and
distracted by oprofile code, it doesnt fight with other cpus on dcache_lock
and other contended cache lines...

2008-11-21 18:44:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent

On Fri, Nov 21, 2008 at 06:58:29PM +0100, Eric Dumazet wrote:
> +/**
> + * d_alloc_unhashed - allocate unhashed dentry
> + * @inode: inode to allocate the dentry for
> + * @name: dentry name

It's normal to list the parameters in the order they're passed to the
function. Not sure if we have a tool that checks for this or not --
Randy?

> + *
> + * Allocate an unhashed dentry for the inode given. The inode is
> + * instantiated and returned. %NULL is returned if there is insufficient
> + * memory. Unhashed dentries have themselves as a parent.
> + */
> +
> +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode)
> +{
> + struct qstr q = { .name = name, .len = strlen(name) };
> + struct dentry *res;
> +
> + res = d_alloc(NULL, &q);
> + if (res) {
> + res->d_sb = inode->i_sb;
> + res->d_parent = res;
> + /*
> + * We dont want to push this dentry into global dentry hash table.
> + * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
> + * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon
> + */

Line length ... as checkpatch would have warned you ;-)

And there are several other grammatical nitpicks with this comment. Try
this:

/*
* We don't want to put this dentry in the global dentry
* hash table, so we pretend the dentry is already hashed
* by unsetting DCACHE_UNHASHED. This permits
* /proc/$pid/fd/XXX t work for sockets, pipes and
* anonymous files (signalfd, timerfd, etc).
*/

> + res->d_flags &= ~DCACHE_UNHASHED;
> + res->d_flags |= DCACHE_DISCONNECTED;

Is this really better than:

res->d_flags = res->d_flags & ~DCACHE_UNHASHED |
DCACHE_DISCONNECTED;

Anyway, nice cleanup.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-11-23 03:54:12

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..9fd0515 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -71,7 +71,6 @@ static struct dentry_operations anon_inodefs_dentry_operations = {
int anon_inode_getfd(const char *name, const struct file_operations *fops,
void *priv, int flags)
{
- struct qstr this;
struct dentry *dentry;
struct file *file;
int error, fd;
@@ -89,10 +88,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
* using the inode sequence number.
*/
error = -ENOMEM;
- this.name = name;
- this.len = strlen(name);
- this.hash = 0;
- dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+ dentry = d_alloc_unhashed(name, anon_inode_inode);
if (!dentry)
goto err_put_unused_fd;

@@ -104,9 +100,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
atomic_inc(&anon_inode_inode->i_count);

dentry->d_op = &anon_inodefs_dentry_operations;
- /* Do not publish this dentry inside the global dentry hash table */
- dentry->d_flags &= ~DCACHE_UNHASHED;
- d_instantiate(dentry, anon_inode_inode);

error = -ENFILE;
file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index a1d86c7..43ef88d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1111,6 +1111,39 @@ struct dentry * d_alloc_root(struct inode * root_inode)
return res;
}

+/**
+ * d_alloc_unhashed - allocate unhashed dentry
+ * @name: dentry name
+ * @inode: inode to allocate the dentry for
+ *
+ * Allocate an unhashed dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory. Unhashed dentries have themselves as a parent.
+ */
+
+struct dentry * d_alloc_unhashed(const char *name, struct inode *inode)
+{
+ struct qstr q = { .name = name, .len = strlen(name) };
+ struct dentry *res;
+
+ res = d_alloc(NULL, &q);
+ if (res) {
+ res->d_sb = inode->i_sb;
+ res->d_parent = res;
+ /*
+ * We dont want to push this dentry into global dentry
+ * hash table, so we pretend the dentry is already hashed
+ * by unsetting DCACHE_UNHASHED. This permits
+ * /proc/$pid/fd/XXX to work for sockets, pipes, and
+ * anonymous files (signalfd, timerfd, ...)
+ */
+ res->d_flags &= ~DCACHE_UNHASHED;
+ res->d_flags |= DCACHE_DISCONNECTED;
+ d_instantiate(res, inode);
+ }
+ return res;
+}
+
static inline struct hlist_head *d_hash(struct dentry *parent,
unsigned long hash)
{
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..29fcac2 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -918,7 +918,6 @@ struct file *create_write_pipe(int flags)
struct inode *inode;
struct file *f;
struct dentry *dentry;
- struct qstr name = { .name = "" };

err = -ENFILE;
inode = get_pipe_inode();
@@ -926,18 +925,11 @@ struct file *create_write_pipe(int flags)
goto err;

err = -ENOMEM;
- dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+ dentry = d_alloc_unhashed("", inode);
if (!dentry)
goto err_inode;

dentry->d_op = &pipefs_dentry_operations;
- /*
- * We dont want to publish this dentry into global dentry hash table.
- * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
- * This permits a working /proc/$pid/fd/XXX on pipes
- */
- dentry->d_flags &= ~DCACHE_UNHASHED;
- d_instantiate(dentry, inode);

err = -ENFILE;
f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..12438d6 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -238,6 +238,7 @@ extern int d_invalidate(struct dentry *);

/* only used at mount-time */
extern struct dentry * d_alloc_root(struct inode *);
+extern struct dentry * d_alloc_unhashed(const char *, struct inode *);

/* <clickety>-<click> the ramfs-type tree */
extern void d_genocide(struct dentry *);
diff --git a/net/socket.c b/net/socket.c
index e9d65ea..b659b5d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -371,20 +371,12 @@ static int sock_alloc_fd(struct file **filep, int flags)
static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
{
struct dentry *dentry;
- struct qstr name = { .name = "" };

- dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+ dentry = d_alloc_unhashed("", SOCK_INODE(sock));
if (unlikely(!dentry))
return -ENOMEM;

dentry->d_op = &sockfs_dentry_operations;
- /*
- * We dont want to push this dentry into global dentry hash table.
- * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
- * This permits a working /proc/$pid/fd/XXX on sockets
- */
- dentry->d_flags &= ~DCACHE_UNHASHED;
- d_instantiate(dentry, SOCK_INODE(sock));

sock->file = file;
init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,


Attachments:
d_alloc_unhashed2.patch (4.68 kB)

2008-11-26 23:29:32

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP

Hi all

Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 1.6 second)

Long version :

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
but would be faster if 'struct files' were using SLAB_DESTROY_BY_RCU
and avoid call_rcu() cache killer)

1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list (for sockets/pipes, this is useless)
- dirties superblock s_inodes.
- dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount


At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.

This patch series get rid of all contended cache lines for sockets, pipes
and anonymous fd (signalfd, timerfd, ...)

Sample program :

for (i = 0; i < 1000000; i++)
close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real 1.561s
user 0.092s
sys 1.469s

Cost if 8 processes are launched on a 8 CPU machine
(benchmark named socket8) :

real 27.496s <<<< !!!! >>>>
user 0.657s
sys 3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples cum. samples % cum. % symbol name
3347352 3347352 28.0232 28.0232 _atomic_dec_and_lock
3301428 6648780 27.6388 55.6620 d_instantiate
2971130 9619910 24.8736 80.5355 d_alloc
241318 9861228 2.0203 82.5558 init_file
146190 10007418 1.2239 83.7797 __slab_free
144149 10151567 1.2068 84.9864 inotify_d_instantiate
143971 10295538 1.2053 86.1917 inet_create
137168 10432706 1.1483 87.3401 new_inode
117549 10550255 0.9841 88.3242 add_partial
110795 10661050 0.9275 89.2517 generic_drop_inode
107137 10768187 0.8969 90.1486 kmem_cache_alloc
94029 10862216 0.7872 90.9358 tcp_close
82837 10945053 0.6935 91.6293 dput
67486 11012539 0.5650 92.1943 dentry_iput
57751 11070290 0.4835 92.6778 iput
54327 11124617 0.4548 93.1326 tcp_v4_init_sock
49921 11174538 0.4179 93.5505 sysenter_past_esp
47616 11222154 0.3986 93.9491 kmem_cache_free
30792 11252946 0.2578 94.2069 clear_inode
27540 11280486 0.2306 94.4375 copy_from_user
26509 11306995 0.2219 94.6594 init_timer
26363 11333358 0.2207 94.8801 discard_slab
25284 11358642 0.2117 95.0918 __fput
22482 11381124 0.1882 95.2800 __percpu_counter_add
20369 11401493 0.1705 95.4505 sock_alloc
18501 11419994 0.1549 95.6054 inet_csk_destroy_sock
17923 11437917 0.1500 95.7555 sys_close


This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.


New cost if run on one cpu :

real 1.325s (instead of 1.561s)
user 0.091s
sys 1.234s


If run on 8 CPUS :

real 2.229s <<<< instead of 27.496s >>>
user 0.695s
sys 16.903s

Oprofile results (for the 8 process run, 3 times):
CPU: Core 2, speed 2999.74 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples cum. samples % cum. % symbol name
143791 143791 11.7849 11.7849 __slab_free
128404 272195 10.5238 22.3087 add_partial
99150 371345 8.1262 30.4349 kmem_cache_alloc
52031 423376 4.2644 34.6993 sysenter_past_esp
47752 471128 3.9137 38.6130 kmem_cache_free
47429 518557 3.8872 42.5002 tcp_close
34376 552933 2.8174 45.3176 __percpu_counter_add
29046 581979 2.3806 47.6982 copy_from_user
28249 610228 2.3152 50.0134 init_timer
26220 636448 2.1490 52.1624 __slab_alloc
23402 659850 1.9180 54.0803 discard_slab
20560 680410 1.6851 55.7654 __call_rcu
18288 698698 1.4989 57.2643 d_alloc
16425 715123 1.3462 58.6104 get_empty_filp
16237 731360 1.3308 59.9412 __fput
15729 747089 1.2891 61.2303 alloc_fd
15021 762110 1.2311 62.4614 alloc_inode
14690 776800 1.2040 63.6654 sys_close
14666 791466 1.2020 64.8674 inet_create
13638 805104 1.1178 65.9852 dput
12503 817607 1.0247 67.0099 iput_special
12231 829838 1.0024 68.0123 lock_sock_nested
12210 842048 1.0007 69.0130 fd_install
12137 854185 0.9947 70.0078 d_alloc_special
12058 866243 0.9883 70.9960 sock_init_data
11200 877443 0.9179 71.9140 release_sock
11114 888557 0.9109 72.8248 inotify_d_instantiate

The last point is about SLUB being hit hard, unless we
use slub_min_order=3 at boot, or we use Christoph Lameter
patch (struct file RCU optimizations)
http://thread.gmane.org/gmane.linux.kernel/418615

If we boot machine with slub_min_order=3, SLUB overhead disappears.

New cost if run on one cpu :

real 1.307s
user 0.094s
sys 1.214s

If run on 8 CPUS :

real 1.625s <<<< instead of 27.496s or 2.229s >>>
user 0.771s
sys 12.061s

Oprofile results (for the 8 process run, 3 times):
CPU: Core 2, speed 3000.05 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples cum. samples % cum. % symbol name
108005 108005 11.0758 11.0758 kmem_cache_alloc
52023 160028 5.3349 16.4107 sysenter_past_esp
47363 207391 4.8570 21.2678 tcp_close
45430 252821 4.6588 25.9266 kmem_cache_free
36566 289387 3.7498 29.6764 __percpu_counter_add
36085 325472 3.7005 33.3769 __slab_free
29185 354657 2.9929 36.3698 copy_from_user
28210 382867 2.8929 39.2627 init_timer
25663 408530 2.6317 41.8944 d_alloc_special
22360 430890 2.2930 44.1874 cap_file_alloc_security
19237 450127 1.9727 46.1601 __call_rcu
19097 469224 1.9584 48.1185 d_alloc
16962 486186 1.7394 49.8580 alloc_fd
16315 502501 1.6731 51.5311 __fput
16102 518603 1.6512 53.1823 get_empty_filp
14954 533557 1.5335 54.7158 inet_create
14468 548025 1.4837 56.1995 alloc_inode
14198 562223 1.4560 57.6555 sys_close
13905 576128 1.4259 59.0814 dput
12262 588390 1.2575 60.3389 lock_sock_nested
12203 600593 1.2514 61.5903 sock_attach_fd
12147 612740 1.2457 62.8360 iput_special
12049 624789 1.2356 64.0716 fd_install
12033 636822 1.2340 65.3056 sock_init_data
11999 648821 1.2305 66.5361 release_sock
11231 660052 1.1517 67.6878 inotify_d_instantiate
11068 671120 1.1350 68.8228 inet_csk_destroy_sock


This patch serie contains 6 patches, against net-next-2.6 tree
(because this tree already contains network improvement on this
subject, but should apply on other trees)

[PATCH 1/6] fs: Introduce a per_cpu nr_dentry

Adding a per_cpu nr_dentry avoids cache line ping pongs between
cpus to maintain this metric.

We centralize decrements of nr_dentry in d_free(),
and increments in d_alloc().

d_alloc() can avoid taking dcache_lock if parent is NULL


[PATCH 2/6] fs: Introduce special dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SPECIAL flag, to mark a dentry as
a special one (for sockets, pipes, anonymous fd), and a new
d_alloc_special(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_special() for
special dentries.

Differences betwen a special dentry and a normal one are :

1) Special dentry has the DCACHE_SPECIAL flag
2) Special dentry's parent are themselves
This to avoid taking a reference on 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table
4) Their d_alias list is empty

Internally, dput() can avoid an expensive atomic_dec_and_lock()
for special dentries.


(socket8 bench result : from 27.5s to 25.5s)

[PATCH 3/6] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino.

Note : last_ino_get() method must be called with preemption disabled.

(socket8 bench result : 25.5s to 25s almost no differences, but
this is because inode_lock cost is too heavy for the moment)

[PATCH 4/6] fs: Introduce a per_cpu nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : 25s to 20.5s)

[PATCH 5/6] fs: Introduce special inodes

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

In new_inode(), we test if super block has MS_SPECIAL flag set.
If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list
As inode_lock was taken only to protect these lists, we avoid it as well

Using iput_special() from dput_special() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying
of three contended cache lines in new_inode(), and five cache lines
in iput()

Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we
really need a different flag.

(socket8 bench result : from 20.5s to 2.94s)

[PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs

This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
refcounting on permanent system vfs.
Use this function for sockets, pipes, anonymous fds.

(socket8 bench result : from 2.94s to 2.23s)

Signed-off-by: Eric Dumazet <[email protected]>
---
Overall diffstat :

fs/anon_inodes.c | 19 +-----
fs/dcache.c | 106 ++++++++++++++++++++++++++++++++-------
fs/fs-writeback.c | 2
fs/inode.c | 101 +++++++++++++++++++++++++++++++------
fs/pipe.c | 28 +---------
fs/super.c | 9 +++
include/linux/dcache.h | 2
include/linux/fs.h | 8 ++
include/linux/mount.h | 5 +
kernel/sysctl.c | 6 +-
mm/page-writeback.c | 2
net/socket.c | 27 +--------
12 files changed, 212 insertions(+), 103 deletions(-)

2008-11-26 23:31:47

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH 1/6] fs: Introduce a per_cpu nr_dentry

diff --git a/fs/dcache.c b/fs/dcache.c
index a1d86c7..42ed9fc 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -61,12 +61,38 @@ static struct kmem_cache *dentry_cache __read_mostly;
static unsigned int d_hash_mask __read_mostly;
static unsigned int d_hash_shift __read_mostly;
static struct hlist_head *dentry_hashtable __read_mostly;
+static DEFINE_PER_CPU(int, nr_dentry);

/* Statistics gathering. */
struct dentry_stat_t dentry_stat = {
.age_limit = 45,
};

+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int cpu;
+ int counter = 0;
+
+ for_each_possible_cpu(cpu)
+ counter += per_cpu(nr_dentry, cpu);
+ if (counter < 0)
+ counter = 0;
+ dentry_stat.nr_dentry = counter;
+ return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ return -ENOSYS;
+}
+#endif
+
static void __d_free(struct dentry *dentry)
{
WARN_ON(!list_empty(&dentry->d_alias));
@@ -82,8 +108,7 @@ static void d_callback(struct rcu_head *head)
}

/*
- * no dcache_lock, please. The caller must decrement dentry_stat.nr_dentry
- * inside dcache_lock.
+ * no dcache_lock, please.
*/
static void d_free(struct dentry *dentry)
{
@@ -94,6 +119,8 @@ static void d_free(struct dentry *dentry)
__d_free(dentry);
else
call_rcu(&dentry->d_u.d_rcu, d_callback);
+ get_cpu_var(nr_dentry)--;
+ put_cpu_var(nr_dentry);
}

/*
@@ -172,7 +199,6 @@ static struct dentry *d_kill(struct dentry *dentry)
struct dentry *parent;

list_del(&dentry->d_u.d_child);
- dentry_stat.nr_dentry--; /* For d_free, below */
/*drops the locks, at that point nobody can reach this dentry */
dentry_iput(dentry);
if (IS_ROOT(dentry))
@@ -619,7 +645,6 @@ void shrink_dcache_sb(struct super_block * sb)
static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
{
struct dentry *parent;
- unsigned detached = 0;

BUG_ON(!IS_ROOT(dentry));

@@ -678,7 +703,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
}

list_del(&dentry->d_u.d_child);
- detached++;

inode = dentry->d_inode;
if (inode) {
@@ -696,7 +720,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
* otherwise we ascend to the parent and move to the
* next sibling if there is one */
if (!parent)
- goto out;
+ return;

dentry = parent;

@@ -705,11 +729,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
dentry = list_entry(dentry->d_subdirs.next,
struct dentry, d_u.d_child);
}
-out:
- /* several dentries were freed, need to correct nr_dentry */
- spin_lock(&dcache_lock);
- dentry_stat.nr_dentry -= detached;
- spin_unlock(&dcache_lock);
}

/*
@@ -943,8 +962,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
dentry->d_flags = DCACHE_UNHASHED;
spin_lock_init(&dentry->d_lock);
dentry->d_inode = NULL;
- dentry->d_parent = NULL;
- dentry->d_sb = NULL;
dentry->d_op = NULL;
dentry->d_fsdata = NULL;
dentry->d_mounted = 0;
@@ -959,15 +976,17 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
if (parent) {
dentry->d_parent = dget(parent);
dentry->d_sb = parent->d_sb;
+ spin_lock(&dcache_lock);
+ list_add(&dentry->d_u.d_child, &parent->d_subdirs);
+ spin_unlock(&dcache_lock);
} else {
+ dentry->d_parent = NULL;
+ dentry->d_sb = NULL;
INIT_LIST_HEAD(&dentry->d_u.d_child);
}

- spin_lock(&dcache_lock);
- if (parent)
- list_add(&dentry->d_u.d_child, &parent->d_subdirs);
- dentry_stat.nr_dentry++;
- spin_unlock(&dcache_lock);
+ get_cpu_var(nr_dentry)++;
+ put_cpu_var(nr_dentry);

return dentry;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0dcdd94..c5e7aa5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2216,6 +2216,8 @@ static inline void free_secdata(void *secdata)
struct ctl_table;
int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos);

int get_filesystem_list(char * buf);

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9d048fa..eebddef 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1243,7 +1243,7 @@ static struct ctl_table fs_table[] = {
.data = &dentry_stat,
.maxlen = 6*sizeof(int),
.mode = 0444,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_nr_dentry,
},
{
.ctl_name = FS_OVERFLOWUID,


Attachments:
per_cpu_nr_dentry.patch (4.67 kB)

2008-11-26 23:33:31

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator

diff --git a/fs/inode.c b/fs/inode.c
index 0487ddb..d850050 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -534,6 +534,30 @@ repeat:
return node ? inode : NULL;
}

+#ifdef CONFIG_SMP
+/*
+ * each cpu owns a block of 1024 numbers.
+ * The global 'last_ino' is dirtied once every 1024 allocations
+ */
+static DEFINE_PER_CPU(int, cpu_ino_alloc) = {0};
+static int last_ino_get(void)
+{
+ static atomic_t last_ino;
+ int *ptr = &__raw_get_cpu_var(cpu_ino_alloc);
+
+ if (unlikely((*ptr & 1023) == 0))
+ *ptr = atomic_add_return(1024, &last_ino);
+ return --(*ptr);
+}
+#else
+static int last_ino_get(void)
+{
+ static int last_ino;
+
+ return ++last_ino;
+}
+#endif
+
/**
* new_inode - obtain an inode
* @sb: superblock
@@ -553,7 +577,6 @@ struct inode *new_inode(struct super_block *sb)
* error if st_ino won't fit in target struct field. Use 32bit counter
* here to attempt to avoid that.
*/
- static unsigned int last_ino;
struct inode * inode;

spin_lock_prefetch(&inode_lock);
@@ -564,7 +587,7 @@ struct inode *new_inode(struct super_block *sb)
inodes_stat.nr_inodes++;
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_sb_list, &sb->s_inodes);
- inode->i_ino = ++last_ino;
+ inode->i_ino = last_ino_get();
inode->i_state = 0;
spin_unlock(&inode_lock);
}


Attachments:
last_ino.patch (1.28 kB)

2008-11-26 23:33:52

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d0ff0b8..b591cdd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);

wbc.nr_to_write = nr_dirty + nr_unstable +
- (inodes_stat.nr_inodes - inodes_stat.nr_unused) +
+ (get_nr_inodes() - inodes_stat.nr_unused) +
nr_dirty + nr_unstable;
wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */
sync_sb_inodes(sb, &wbc);
diff --git a/fs/inode.c b/fs/inode.c
index d850050..8d8d40e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex);
* Statistics gathering..
*/
struct inodes_stat_t inodes_stat;
+static DEFINE_PER_CPU(int, nr_inodes);

static struct kmem_cache * inode_cachep __read_mostly;

+int get_nr_inodes(void)
+{
+ int cpu;
+ int counter = 0;
+
+ for_each_possible_cpu(cpu)
+ counter += per_cpu(nr_inodes, cpu);
+ if (counter < 0)
+ counter = 0;
+ return counter;
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ inodes_stat.nr_inodes = get_nr_inodes();
+ return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ return -ENOSYS;
+}
+#endif
+
static void wake_up_inode(struct inode *inode)
{
/*
@@ -306,9 +337,8 @@ static void dispose_list(struct list_head *head)
destroy_inode(inode);
nr_disposed++;
}
- spin_lock(&inode_lock);
- inodes_stat.nr_inodes -= nr_disposed;
- spin_unlock(&inode_lock);
+ get_cpu_var(nr_inodes) -= nr_disposed;
+ put_cpu_var(nr_inodes);
}

/*
@@ -584,10 +614,11 @@ struct inode *new_inode(struct super_block *sb)
inode = alloc_inode(sb);
if (inode) {
spin_lock(&inode_lock);
- inodes_stat.nr_inodes++;
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_sb_list, &sb->s_inodes);
+ get_cpu_var(nr_inodes)--;
inode->i_ino = last_ino_get();
+ put_cpu_var(nr_inodes);
inode->i_state = 0;
spin_unlock(&inode_lock);
}
@@ -645,7 +676,8 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
if (set(inode, data))
goto set_failed;

- inodes_stat.nr_inodes++;
+ get_cpu_var(nr_inodes)++;
+ put_cpu_var(nr_inodes);
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_sb_list, &sb->s_inodes);
hlist_add_head(&inode->i_hash, head);
@@ -694,7 +726,8 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
old = find_inode_fast(sb, head, ino);
if (!old) {
inode->i_ino = ino;
- inodes_stat.nr_inodes++;
+ get_cpu_var(nr_inodes)++;
+ put_cpu_var(nr_inodes);
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_sb_list, &sb->s_inodes);
hlist_add_head(&inode->i_hash, head);
@@ -1065,8 +1098,9 @@ void generic_delete_inode(struct inode *inode)
list_del_init(&inode->i_list);
list_del_init(&inode->i_sb_list);
inode->i_state |= I_FREEING;
- inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
+ get_cpu_var(nr_inodes)--;
+ put_cpu_var(nr_inodes);

security_inode_delete(inode);

@@ -1116,8 +1150,9 @@ static void generic_forget_inode(struct inode *inode)
list_del_init(&inode->i_list);
list_del_init(&inode->i_sb_list);
inode->i_state |= I_FREEING;
- inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
+ get_cpu_var(nr_inodes)--;
+ put_cpu_var(nr_inodes);
if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c5e7aa5..2482977 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -47,6 +47,7 @@ struct inodes_stat_t {
int dummy[5]; /* padding for sysctl ABI compatibility */
};
extern struct inodes_stat_t inodes_stat;
+extern int get_nr_inodes(void);

extern int leases_enable, lease_break_time;

@@ -2218,6 +2219,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos);
int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos);

int get_filesystem_list(char * buf);

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index eebddef..eebed01 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1202,7 +1202,7 @@ static struct ctl_table fs_table[] = {
.data = &inodes_stat,
.maxlen = 2*sizeof(int),
.mode = 0444,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_nr_inodes,
},
{
.ctl_name = FS_STATINODE,
@@ -1210,7 +1210,7 @@ static struct ctl_table fs_table[] = {
.data = &inodes_stat,
.maxlen = 7*sizeof(int),
.mode = 0444,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_nr_inodes,
},
{
.procname = "file-nr",
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2970e35..a71a922 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
next_jif = start_jif + dirty_writeback_interval;
nr_to_write = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS) +
- (inodes_stat.nr_inodes - inodes_stat.nr_unused);
+ (get_nr_inodes() - inodes_stat.nr_unused);
while (nr_to_write > 0) {
wbc.more_io = 0;
wbc.encountered_congestion = 0;


Attachments:
per_cpu_nr_inodes.patch (5.57 kB)

2008-11-26 23:34:18

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH 5/6] fs: Introduce special inodes

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 4f20d48..a0212b3 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -158,6 +158,7 @@ static int __init anon_inode_init(void)
error = PTR_ERR(anon_inode_mnt);
goto err_unregister_filesystem;
}
+ anon_inode_mnt->mnt_sb->s_flags |= MS_SPECIAL;
anon_inode_inode = anon_inode_mkinode();
if (IS_ERR(anon_inode_inode)) {
error = PTR_ERR(anon_inode_inode);
diff --git a/fs/dcache.c b/fs/dcache.c
index d73763b..bade7d7 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -239,7 +239,7 @@ static void dput_special(struct dentry *dentry)
return;
inode = dentry->d_inode;
if (inode)
- iput(inode);
+ iput_special(inode);
d_free(dentry);
}

diff --git a/fs/inode.c b/fs/inode.c
index 8d8d40e..1bb6553 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -228,6 +228,14 @@ void destroy_inode(struct inode *inode)
kmem_cache_free(inode_cachep, (inode));
}

+void iput_special(struct inode *inode)
+{
+ if (atomic_dec_and_test(&inode->i_count)) {
+ destroy_inode(inode);
+ get_cpu_var(nr_inodes)--;
+ put_cpu_var(nr_inodes);
+ }
+}

/*
* These are initializations that only need to be done
@@ -609,18 +617,21 @@ struct inode *new_inode(struct super_block *sb)
*/
struct inode * inode;

- spin_lock_prefetch(&inode_lock);
-
inode = alloc_inode(sb);
if (inode) {
- spin_lock(&inode_lock);
- list_add(&inode->i_list, &inode_in_use);
- list_add(&inode->i_sb_list, &sb->s_inodes);
+ inode->i_state = 0;
+ if (sb->s_flags & MS_SPECIAL) {
+ INIT_LIST_HEAD(&inode->i_list);
+ INIT_LIST_HEAD(&inode->i_sb_list);
+ } else {
+ spin_lock(&inode_lock);
+ list_add(&inode->i_list, &inode_in_use);
+ list_add(&inode->i_sb_list, &sb->s_inodes);
+ spin_unlock(&inode_lock);
+ }
get_cpu_var(nr_inodes)--;
inode->i_ino = last_ino_get();
put_cpu_var(nr_inodes);
- inode->i_state = 0;
- spin_unlock(&inode_lock);
}
return inode;
}
diff --git a/fs/pipe.c b/fs/pipe.c
index 5cc132a..6fca681 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1078,7 +1078,8 @@ static int __init init_pipe_fs(void)
if (IS_ERR(pipe_mnt)) {
err = PTR_ERR(pipe_mnt);
unregister_filesystem(&pipe_fs_type);
- }
+ } else
+ pipe_mnt->mnt_sb->s_flags |= MS_SPECIAL;
}
return err;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2482977..dd0e8a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -136,6 +136,7 @@ extern int dir_notify_enable;
#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
+#define MS_SPECIAL (1<<24) /* special fs (inodes not in sb->s_inodes) */
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)

@@ -1898,6 +1899,7 @@ extern void __iget(struct inode * inode);
extern void iget_failed(struct inode *);
extern void clear_inode(struct inode *);
extern void destroy_inode(struct inode *);
+extern void iput_special(struct inode *inode);
extern struct inode *new_inode(struct super_block *);
extern int should_remove_suid(struct dentry *);
extern int file_remove_suid(struct file *);
diff --git a/net/socket.c b/net/socket.c
index f41b6c6..4177456 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2205,6 +2205,7 @@ static int __init sock_init(void)
init_inodecache();
register_filesystem(&sock_fs_type);
sock_mnt = kern_mount(&sock_fs_type);
+ sock_mnt->mnt_sb->s_flags |= MS_SPECIAL;

/* The real protocol initialization is performed in later initcalls.
*/


Attachments:
special_inodes.patch (3.47 kB)

2008-11-26 23:34:36

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index a0212b3..42dfe28 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -153,7 +153,7 @@ static int __init anon_inode_init(void)
error = register_filesystem(&anon_inode_fs_type);
if (error)
goto err_exit;
- anon_inode_mnt = kern_mount(&anon_inode_fs_type);
+ anon_inode_mnt = kern_mount_special(&anon_inode_fs_type);
if (IS_ERR(anon_inode_mnt)) {
error = PTR_ERR(anon_inode_mnt);
goto err_unregister_filesystem;
diff --git a/fs/pipe.c b/fs/pipe.c
index 6fca681..391d4fe 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1074,7 +1074,7 @@ static int __init init_pipe_fs(void)
int err = register_filesystem(&pipe_fs_type);

if (!err) {
- pipe_mnt = kern_mount(&pipe_fs_type);
+ pipe_mnt = kern_mount_special(&pipe_fs_type);
if (IS_ERR(pipe_mnt)) {
err = PTR_ERR(pipe_mnt);
unregister_filesystem(&pipe_fs_type);
diff --git a/fs/super.c b/fs/super.c
index 400a760..a8e14f7 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -982,3 +982,12 @@ struct vfsmount *kern_mount_data(struct file_system_type *type, void *data)
}

EXPORT_SYMBOL_GPL(kern_mount_data);
+
+struct vfsmount *kern_mount_special(struct file_system_type *type)
+{
+ struct vfsmount *res = kern_mount_data(type, NULL);
+
+ if (!IS_ERR(res))
+ res->mnt_flags |= MNT_SPECIAL;
+ return res;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd0e8a5..a92544a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1591,6 +1591,7 @@ extern int register_filesystem(struct file_system_type *);
extern int unregister_filesystem(struct file_system_type *);
extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data);
#define kern_mount(type) kern_mount_data(type, NULL)
+extern struct vfsmount *kern_mount_special(struct file_system_type *);
extern int may_umount_tree(struct vfsmount *);
extern int may_umount(struct vfsmount *);
extern long do_mount(char *, char *, char *, unsigned long, void *);
diff --git a/include/linux/mount.h b/include/linux/mount.h
index cab2a85..cb4fa90 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -30,6 +30,7 @@ struct mnt_namespace;

#define MNT_SHRINKABLE 0x100
#define MNT_IMBALANCED_WRITE_COUNT 0x200 /* just for debugging */
+#define MNT_SPECIAL 0x400 /* special mount (pipes,sockets,...) */

#define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */
#define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */
@@ -73,7 +74,7 @@ struct vfsmount {

static inline struct vfsmount *mntget(struct vfsmount *mnt)
{
- if (mnt)
+ if (mnt && !(mnt->mnt_flags & MNT_SPECIAL))
atomic_inc(&mnt->mnt_count);
return mnt;
}
@@ -87,7 +88,7 @@ extern int __mnt_is_readonly(struct vfsmount *mnt);

static inline void mntput(struct vfsmount *mnt)
{
- if (mnt) {
+ if (mnt && !(mnt->mnt_flags & MNT_SPECIAL)) {
mnt->mnt_expiry_mark = 0;
mntput_no_expire(mnt);
}
diff --git a/net/socket.c b/net/socket.c
index 4177456..2857d70 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2204,7 +2204,7 @@ static int __init sock_init(void)

init_inodecache();
register_filesystem(&sock_fs_type);
- sock_mnt = kern_mount(&sock_fs_type);
+ sock_mnt = kern_mount_special(&sock_fs_type);
sock_mnt->mnt_sb->s_flags |= MS_SPECIAL;

/* The real protocol initialization is performed in later initcalls.


Attachments:
mnt_special.patch (3.27 kB)

2008-11-27 01:38:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP

On Thu, 27 Nov 2008, Eric Dumazet wrote:

> The last point is about SLUB being hit hard, unless we
> use slub_min_order=3 at boot, or we use Christoph Lameter
> patch (struct file RCU optimizations)
> http://thread.gmane.org/gmane.linux.kernel/418615
>
> If we boot machine with slub_min_order=3, SLUB overhead disappears.


I'd rather not be that drastic. Did you try increasing slub_min_objects
instead? Try 40-100. If we find the right number then we should update
the tuning to make sure that it pickes the right slab page sizes.

2008-11-27 06:28:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP

Christoph Lameter a ?crit :
> On Thu, 27 Nov 2008, Eric Dumazet wrote:
>
>> The last point is about SLUB being hit hard, unless we
>> use slub_min_order=3 at boot, or we use Christoph Lameter
>> patch (struct file RCU optimizations)
>> http://thread.gmane.org/gmane.linux.kernel/418615
>>
>> If we boot machine with slub_min_order=3, SLUB overhead disappears.
>
>
> I'd rather not be that drastic. Did you try increasing slub_min_objects
> instead? Try 40-100. If we find the right number then we should update
> the tuning to make sure that it pickes the right slab page sizes.
>
>

4096/192 = 21

with slub_min_objects=22 :

# cat /sys/kernel/slab/filp/order
1
# time ./socket8
real 0m1.725s
user 0m0.685s
sys 0m12.955s

with slub_min_objects=45 :

# cat /sys/kernel/slab/filp/order
2
# time ./socket8
real 0m1.652s
user 0m0.694s
sys 0m12.367s

with slub_min_objects=80 :

# cat /sys/kernel/slab/filp/order
3
# time ./socket8
real 0m1.642s
user 0m0.719s
sys 0m12.315s

I would say slub_min_objects=45 is the optimal value on 32bit arches to
get acceptable performance on this workload (order=2 for filp kmem_cache)

Note : SLAB here is disastrous, but you already knew that :)

real 0m8.128s
user 0m0.748s
sys 1m3.467s

2008-11-27 08:21:00

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 5/6] fs: Introduce special inodes

From: Eric Dumazet <[email protected]>
Date: Thu, 27 Nov 2008 00:32:41 +0100

> Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
> inodes allocation/freeing.
>
> In new_inode(), we test if super block has MS_SPECIAL flag set.
> If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list
> As inode_lock was taken only to protect these lists, we avoid it as well
>
> Using iput_special() from dput_special() avoids taking inode_lock
> at freeing time.
>
> This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines
> in iput()
>
> Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we
> really need a different flag.
>
> (socket8 bench result : from 20.5s to 2.94s)
>
> Signed-off-by: Eric Dumazet <[email protected]>

No problem with networking part:

Acked-by: David S. Miller <[email protected]>

2008-11-27 08:21:25

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs

From: Eric Dumazet <[email protected]>
Date: Thu, 27 Nov 2008 00:32:59 +0100

> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> refcounting on permanent system vfs.
> Use this function for sockets, pipes, anonymous fds.
>
> (socket8 bench result : from 2.94s to 2.23s)
>
> Signed-off-by: Eric Dumazet <[email protected]>

For networking bits:

Acked-by: David S. Miller <[email protected]>

2008-11-27 09:33:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes

On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote:
> Avoids cache line ping pongs between cpus and prepare next patch,
> because updates of nr_inodes metric dont need inode_lock anymore.
>
> (socket8 bench result : 25s to 20.5s)
>
> Signed-off-by: Eric Dumazet <[email protected]>
> ---

> @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex);
> * Statistics gathering..
> */
> struct inodes_stat_t inodes_stat;
> +static DEFINE_PER_CPU(int, nr_inodes);
>
> static struct kmem_cache * inode_cachep __read_mostly;
>
> +int get_nr_inodes(void)
> +{
> + int cpu;
> + int counter = 0;
> +
> + for_each_possible_cpu(cpu)
> + counter += per_cpu(nr_inodes, cpu);
> + if (counter < 0)
> + counter = 0;
> + return counter;
> +}

It would be good to get a cpu hotplug handler here and move to
for_each_online_cpu(). People are wanting distro's to be build with
NR_CPUS=4096.

2008-11-27 09:40:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP


As I told you before, you absolutely must include the fsdevel list and
the VFS maintainer for a patchset like this.

2008-11-27 09:40:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes

On Thu, 2008-11-27 at 10:33 +0100, Peter Zijlstra wrote:
> On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote:
> > Avoids cache line ping pongs between cpus and prepare next patch,
> > because updates of nr_inodes metric dont need inode_lock anymore.
> >
> > (socket8 bench result : 25s to 20.5s)
> >
> > Signed-off-by: Eric Dumazet <[email protected]>
> > ---
>
> > @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex);
> > * Statistics gathering..
> > */
> > struct inodes_stat_t inodes_stat;
> > +static DEFINE_PER_CPU(int, nr_inodes);
> >
> > static struct kmem_cache * inode_cachep __read_mostly;
> >
> > +int get_nr_inodes(void)
> > +{
> > + int cpu;
> > + int counter = 0;
> > +
> > + for_each_possible_cpu(cpu)
> > + counter += per_cpu(nr_inodes, cpu);
> > + if (counter < 0)
> > + counter = 0;
> > + return counter;
> > +}
>
> It would be good to get a cpu hotplug handler here and move to
> for_each_online_cpu(). People are wanting distro's to be build with
> NR_CPUS=4096.

Also, this trade-off between global vs per_cpu only works if
get_nr_inodes() is called significantly less than nr_inodes is changed.

With it being called from writeback that might not be true for all
workloads. One thing you can do about it is use the regular per-cpu
counter stuff, which allows you to do an approximation of the global
number (it also does all the hotplug stuff for you already).

2008-11-27 09:42:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Introduce a per_cpu nr_dentry

Looks good modulo the exact version of the for_each_cpu loops that the
experts in that area can help with. Same for the per_cpu nr_inodes
patch.

2008-11-27 09:46:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator

On Thu, Nov 27, 2008 at 12:32:24AM +0100, Eric Dumazet wrote:
> new_inode() dirties a contended cache line to get inode numbers.
>
> Solve this problem by providing to each cpu a per_cpu variable,
> feeded by the shared last_ino, but once every 1024 allocations.
>
> This reduce contention on the shared last_ino.
>
> Note : last_ino_get() method must be called with preemption
> disabled on SMP.

Looks a little clumsy. One idea might be to have a special slab for
synthetic inodes using new_inode and only assign it on the first
allocation and after that re-use it.

2008-11-27 09:49:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes

On Thu, Nov 27, 2008 at 10:39:31AM +0100, Peter Zijlstra wrote:
> With it being called from writeback that might not be true for all
> workloads. One thing you can do about it is use the regular per-cpu
> counter stuff, which allows you to do an approximation of the global
> number (it also does all the hotplug stuff for you already).

The way it's used in writeback is utterly stupid and should be fixed :)

But otherwise agreed.

2008-11-27 09:53:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs

On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> refcounting on permanent system vfs.
> Use this function for sockets, pipes, anonymous fds.

special is not a useful name for a flag, by definition everything that
needs a flag is special compared to the version that doesn't need a
flag.

The general idea of skippign the writer counts makes sense, but please
give it a descriptive name that explains the not unmountable thing.
And please kill your kern_mount wrapper and just set the flag manually.

Also I think it should be a superblock flag, not a mount flag as you
don't want thse to differ for multiple mounts of the same filesystem.

2008-11-27 10:01:58

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes

Peter Zijlstra a ?crit :
> On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote:
>> Avoids cache line ping pongs between cpus and prepare next patch,
>> because updates of nr_inodes metric dont need inode_lock anymore.
>>
>> (socket8 bench result : 25s to 20.5s)
>>
>> Signed-off-by: Eric Dumazet <[email protected]>
>> ---
>
>> @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex);
>> * Statistics gathering..
>> */
>> struct inodes_stat_t inodes_stat;
>> +static DEFINE_PER_CPU(int, nr_inodes);
>>
>> static struct kmem_cache * inode_cachep __read_mostly;
>>
>> +int get_nr_inodes(void)
>> +{
>> + int cpu;
>> + int counter = 0;
>> +
>> + for_each_possible_cpu(cpu)
>> + counter += per_cpu(nr_inodes, cpu);
>> + if (counter < 0)
>> + counter = 0;
>> + return counter;
>> +}
>
> It would be good to get a cpu hotplug handler here and move to
> for_each_online_cpu(). People are wanting distro's to be build with
> NR_CPUS=4096.

Hum, I guess we can use regular percpu_counter for this...

2008-11-27 10:05:28

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs

Christoph Hellwig a ?crit :
> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>> refcounting on permanent system vfs.
>> Use this function for sockets, pipes, anonymous fds.
>
> special is not a useful name for a flag, by definition everything that
> needs a flag is special compared to the version that doesn't need a
> flag.
>
> The general idea of skippign the writer counts makes sense, but please
> give it a descriptive name that explains the not unmountable thing.
> And please kill your kern_mount wrapper and just set the flag manually.
>
> Also I think it should be a superblock flag, not a mount flag as you
> don't want thse to differ for multiple mounts of the same filesystem.
>
>

Hum.. we have a superblock flag already, but testing it in mntput()/mntget()
is going to be a litle bit expensive if we add a derefence ?

if (mnt && mnt->mnt_sb->s_flags & MS_SPECIAL) {
...
}

2008-11-27 10:07:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes

Peter Zijlstra <[email protected]> writes:
>>
>> +int get_nr_inodes(void)
>> +{
>> + int cpu;
>> + int counter = 0;
>> +
>> + for_each_possible_cpu(cpu)
>> + counter += per_cpu(nr_inodes, cpu);
>> + if (counter < 0)
>> + counter = 0;
>> + return counter;
>> +}
>
> It would be good to get a cpu hotplug handler here and move to
> for_each_online_cpu(). People are wanting distro's to be build with
> NR_CPUS=4096.

Doesn't matter, possible cpus is always only set to what the
machine supports.

-Andi
--
[email protected]

2008-11-27 10:10:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs

On Thu, Nov 27, 2008 at 11:04:38AM +0100, Eric Dumazet wrote:
> Hum.. we have a superblock flag already, but testing it in mntput()/mntget()
> is going to be a litle bit expensive if we add a derefence ?
>
> if (mnt && mnt->mnt_sb->s_flags & MS_SPECIAL) {
> ...
> }

Well, run a benchmark to see if it makes any difference. And when it
does please always set the mount flag from the common mount code when
it's set on the superblock, and document that this is the only valid way
to set it.

2008-11-27 14:45:57

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP

On Thu, 27 Nov 2008, Eric Dumazet wrote:

> with slub_min_objects=45 :
>
> # cat /sys/kernel/slab/filp/order
> 2
> # time ./socket8
> real 0m1.652s
> user 0m0.694s
> sys 0m12.367s

That may be a good value. How many processor do you have? Look at
calculate_order() in mm/slub.c:

if (!min_objects)
min_objects = 4 * (fls(nr_cpu_ids) + 1);

We couild increase the scaling factor there or start
with a mininum of 20 objects?


Try

min_objects = 20 + 4 * (fls(nr_cpu_ids) + 1);

> I would say slub_min_objects=45 is the optimal value on 32bit arches to
> get acceptable performance on this workload (order=2 for filp kmem_cache)
>
> Note : SLAB here is disastrous, but you already knew that :)

Its good though to have examples where the queue management gets in the
way of performance.

2008-11-27 14:47:38

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes

On Thu, 27 Nov 2008, Peter Zijlstra wrote:

> It would be good to get a cpu hotplug handler here and move to
> for_each_online_cpu(). People are wanting distro's to be build with
> NR_CPUS=4096.

NR_CPUS=4096 does not necessarily increase the number of possible cpus.

2008-11-28 09:26:39

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs

On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> refcounting on permanent system vfs.
> Use this function for sockets, pipes, anonymous fds.

IMO that's pushing it past the point of usefulness; unless you can show
that this really gives considerable win on pipes et.al. *AND* that it
doesn't hurt other loads...

dput() part: again, I want to see what happens on other loads; it's probably
fine (and win is certainly more than from mntput() change), but... The
thing is, atomic_dec_and_lock() in there is often done on dentries with
d_count > 1 and that's fairly cheap (and doesn't involve contention on
dcache_lock on sane targets).

FWIW, unless there's a really good reason to do alpha atomic_dec_and_lock()
in a special way, I'd try to compare with
if (atomic_add_unless(&dentry->d_count, -1, 1))
return;
if (your flag)
sod off to special
spin_lock(&dcache_lock);
if (atomic_dec_and_test(&dentry->d_count)) {
spin_unlock(&dcache_lock);
return;
}
the rest as usual

As for the alpha... unless I'm misreading the assembler in
arch/alpha/lib/dec_and_lock.c, it looks like we have essentially an
implementation of atomic_add_unless() in there and one that just
might be better than what we've got in arch/alpha/include/asm/atomic.h.
How about
1: ldl_l x, addr
cmpne x, u, y /* y = x != u */
beq y, 3f /* if !y -> bugger off, return 0 */
addl x, a, y
stl_c y, addr /* y <- *addr has not changed since ldl_l */
beq y, 2f
3: /* return value is in y */
.subsection 2 /* out of the way */
2: br 1b
.previous
for atomic_add_unless() guts? With that we are rid of HAVE_DEC_LOCK and
get a uniform implementation of atomic_dec_and_lock() for all targets...

AFAICS, that would be
static __inline__ int atomic_add_unless(atomic_t *v, int a, int u)
{
unsigned long temp, res;
__asm__ __volatile__(
"1: ldl_l %0,%1\n"
" cmpne %0,%4,%2\n"
" beq %4,3f\n"
" addl %0,%3,%4\n"
" stl_c %2,%1\n"
" beq %2,2f\n"
"3:\n"
".subsection 2\n"
"2: br 1b\n"
".previous"
:"=&r" (temp), "=m" (v->counter), "=&r" (res)
:"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
smp_mb();
return res;
}

static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u)
{
unsigned long temp, res;
__asm__ __volatile__(
"1: ldq_l %0,%1\n"
" cmpne %0,%4,%2\n"
" beq %4,3f\n"
" addq %0,%3,%4\n"
" stq_c %2,%1\n"
" beq %2,2f\n"
"3:\n"
".subsection 2\n"
"2: br 1b\n"
".previous"
:"=&r" (temp), "=m" (v->counter), "=&r" (res)
:"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
smp_mb();
return res;
}

Comments?

2008-11-28 09:34:40

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs

On Fri, Nov 28, 2008 at 09:26:04AM +0000, Al Viro wrote:

gyah... That would be

> static __inline__ int atomic_add_unless(atomic_t *v, int a, int u)
> {
> unsigned long temp, res;
> __asm__ __volatile__(
> "1: ldl_l %0,%1\n"
> " cmpne %0,%4,%2\n"
" beq %2,3f\n"
" addl %0,%3,%2\n"
> " stl_c %2,%1\n"
> " beq %2,2f\n"
> "3:\n"
> ".subsection 2\n"
> "2: br 1b\n"
> ".previous"
> :"=&r" (temp), "=m" (v->counter), "=&r" (res)
> :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
> smp_mb();
> return res;
> }
>
> static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u)
> {
> unsigned long temp, res;
> __asm__ __volatile__(
> "1: ldq_l %0,%1\n"
> " cmpne %0,%4,%2\n"
" beq %2,3f\n"
" addq %0,%3,%2\n"
> " stq_c %2,%1\n"
> " beq %2,2f\n"
> "3:\n"
> ".subsection 2\n"
> "2: br 1b\n"
> ".previous"
> :"=&r" (temp), "=m" (v->counter), "=&r" (res)
> :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory");
> smp_mb();
> return res;
> }
>
> Comments?
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-11-28 18:03:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs


* Al Viro <[email protected]> wrote:

> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
> > This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
> > refcounting on permanent system vfs.
> > Use this function for sockets, pipes, anonymous fds.
>
> IMO that's pushing it past the point of usefulness; unless you can show
> that this really gives considerable win on pipes et.al. *AND* that it
> doesn't hurt other loads...

The numbers look pretty convincing:

> > (socket8 bench result : from 2.94s to 2.23s)

And i wouldnt expect it to hurt real-filesystem workloads.

Here's the contemporary trace of a typical ext3- sys_open():

0) | sys_open() {
0) | do_sys_open() {
0) | getname() {
0) 0.367 us | kmem_cache_alloc();
0) | strncpy_from_user(); {
0) | _cond_resched() {
0) | need_resched() {
0) 0.363 us | constant_test_bit();
0) 1. 47 us | }
0) 1.815 us | }
0) 2.587 us | }
0) 4. 22 us | }
0) | alloc_fd() {
0) 0.480 us | _spin_lock();
0) 0.487 us | expand_files();
0) 2.356 us | }
0) | do_filp_open() {
0) | path_lookup_open() {
0) | get_empty_filp() {
0) 0.439 us | kmem_cache_alloc();
0) | security_file_alloc() {
0) 0.316 us | cap_file_alloc_security();
0) 1. 87 us | }
0) 3.189 us | }
0) | do_path_lookup() {
0) 0.366 us | _read_lock();
0) | path_walk() {
0) | __link_path_walk() {
0) | inode_permission() {
0) | ext3_permission() {
0) 0.441 us | generic_permission();
0) 1.247 us | }
0) | security_inode_permission() {
0) 0.411 us | cap_inode_permission();
0) 1.186 us | }
0) 3.555 us | }
0) | do_lookup() {
0) | __d_lookup() {
0) 0.486 us | _spin_lock();
0) 1.369 us | }
0) 0.442 us | __follow_mount();
0) 3. 14 us | }
0) | path_to_nameidata() {
0) 0.476 us | dput();
0) 1.235 us | }
0) | inode_permission() {
0) | ext3_permission() {
0) | generic_permission() {
0) | in_group_p() {
0) 0.410 us | groups_search();
0) 1.172 us | }
0) 1.994 us | }
0) 2.789 us | }
0) | security_inode_permission() {
0) 0.454 us | cap_inode_permission();
0) 1.238 us | }
0) 5.262 us | }
0) | do_lookup() {
0) | __d_lookup() {
0) 0.480 us | _spin_lock();
0) 1.621 us | }
0) 0.456 us | __follow_mount();
0) 3.215 us | }
0) | path_to_nameidata() {
0) 0.420 us | dput();
0) 1.193 us | }
0) + 23.551 us | }
0) | path_put() {
0) 0.420 us | dput();
0) | mntput() {
0) 0.359 us | mntput_no_expire();
0) 1. 50 us | }
0) 2.544 us | }
0) + 27.253 us | }
0) + 28.850 us | }
0) + 33.217 us | }
0) | may_open() {
0) | inode_permission() {
0) | ext3_permission() {
0) 0.480 us | generic_permission();
0) 1.229 us | }
0) | security_inode_permission() {
0) 0.405 us | cap_inode_permission();
0) 1.196 us | }
0) 3.589 us | }
0) 4.600 us | }
0) | nameidata_to_filp() {
0) | __dentry_open() {
0) | file_move() {
0) 0.470 us | _spin_lock();
0) 1.243 us | }
0) | security_dentry_open() {
0) 0.344 us | cap_dentry_open();
0) 1.139 us | }
0) 0.412 us | generic_file_open();
0) 0.561 us | file_ra_state_init();
0) 5.714 us | }
0) 6.483 us | }
0) + 46.494 us | }
0) 0.453 us | inotify_dentry_parent_queue_event();
0) 0.403 us | inotify_inode_queue_event();
0) | fd_install() {
0) 0.440 us | _spin_lock();
0) 1.247 us | }
0) | putname() {
0) | kmem_cache_free() {
0) | virt_to_head_page() {
0) 0.369 us | constant_test_bit();
0) 1. 23 us | }
0) 1.738 us | }
0) 2.422 us | }
0) + 60.560 us | }
0) + 61.368 us | }

and here's a sys_close():

0) | sys_close() {
0) 0.540 us | _spin_lock();
0) | filp_close() {
0) 0.437 us | dnotify_flush();
0) 0.401 us | locks_remove_posix();
0) 0.349 us | fput();
0) 2.679 us | }
0) 4.452 us | }

i'd be surprised to see a flag to show up in that codepath. Eric, does
your testing confirm that?

Ingo

2008-11-28 18:03:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP


* Eric Dumazet <[email protected]> wrote:

> Hi all
>
> Short summary : Nice speedups for allocation/deallocation of sockets/pipes
> (From 27.5 seconds to 1.6 second)

Wow, that's incredibly impressive! :-)

Ingo

2008-11-28 18:48:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP

On Fri, 2008-11-28 at 19:03 +0100, Ingo Molnar wrote:
> * Eric Dumazet <[email protected]> wrote:
>
> > Hi all
> >
> > Short summary : Nice speedups for allocation/deallocation of sockets/pipes
> > (From 27.5 seconds to 1.6 second)
>
> Wow, that's incredibly impressive! :-)

Yeah, we got a similar speedup on -rt by pushing those super-block files
list into per-cpu lists and doing crazy locking on them.

Of course avoiding them all together, like done here is a nicer option
but is sadly not a possibility for regular files (until hch gets around
to removing the need for the list).


2008-11-28 18:59:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs


* Ingo Molnar <[email protected]> wrote:

> And i wouldnt expect it to hurt real-filesystem workloads.
>
> Here's the contemporary trace of a typical ext3- sys_open():

here's a sys_open() that has to touch atime:

0) | sys_open() {
0) | do_sys_open() {
0) | getname() {
0) 0.377 us | kmem_cache_alloc();
0) | strncpy_from_user() {
0) | _cond_resched() {
0) | need_resched() {
0) 0.353 us | constant_test_bit();
0) 1. 45 us | }
0) 1.739 us | }
0) 2.492 us | }
0) 3.934 us | }
0) | alloc_fd() {
0) 0.374 us | _spin_lock();
0) 0.447 us | expand_files();
0) 2.124 us | }
0) | do_filp_open() {
0) | path_lookup_open() {
0) | get_empty_filp() {
0) 0.689 us | kmem_cache_alloc();
0) | security_file_alloc() {
0) 0.327 us | cap_file_alloc_security();
0) 1. 71 us | }
0) 2.869 us | }
0) | do_path_lookup() {
0) 0.460 us | _read_lock();
0) | path_walk() {
0) | __link_path_walk() {
0) | inode_permission() {
0) | ext3_permission() {
0) 0.434 us | generic_permission();
0) 1.191 us | }
0) | security_inode_permission() {
0) 0.400 us | cap_inode_permission();
0) 1.130 us | }
0) 3.453 us | }
0) | do_lookup() {
0) | __d_lookup() {
0) 0.489 us | _spin_lock();
0) 1.525 us | }
0) 0.449 us | __follow_mount();
0) 3.115 us | }
0) | path_to_nameidata() {
0) 0.422 us | dput();
0) 1.204 us | }
0) | inode_permission() {
0) | ext3_permission() {
0) 0.391 us | generic_permission();
0) 1.223 us | }
0) | security_inode_permission() {
0) 0.406 us | cap_inode_permission();
0) 1.189 us | }
0) 3.565 us | }
0) | do_lookup() {
0) | __d_lookup() {
0) 0.527 us | _spin_lock();
0) 1.633 us | }
0) 0.440 us | __follow_mount();
0) 3.223 us | }
0) | do_follow_link() {
0) | _cond_resched() {
0) | need_resched() {
0) 0.361 us | constant_test_bit();
0) 1. 64 us | }
0) 1.749 us | }
0) | security_inode_follow_link() {
0) 0.390 us | cap_inode_follow_link();
0) 1.260 us | }
0) | touch_atime() {
0) | mnt_want_write() {
0) 0.360 us | _spin_lock();
0) 1.137 us | }
0) | mnt_drop_write() {
0) 0.348 us | _spin_lock();
0) 1.102 us | }
0) 3.402 us | }
0) 0.446 us | ext3_follow_link();
0) | __link_path_walk() {
0) | inode_permission() {
0) | ext3_permission() {
0) | generic_permission() {
0) 4.481 us | }
0) | security_inode_permission() {
0) 0.402 us | cap_inode_permission();
0) 1.127 us | }
0) 6.747 us | }
0) | do_lookup() {
0) | __d_lookup() {
0) 0.547 us | _spin_lock();
0) 1.758 us | }
0) 0.465 us | __follow_mount();
0) 3.368 us | }
0) | path_to_nameidata() {
0) 0.419 us | dput();
0) 1.203 us | }
0) + 13. 40 us | }
0) | path_put() {
0) 0.429 us | dput();
0) | mntput() {
0) 0.367 us | mntput_no_expire();
0) 1.130 us | }
0) 2.660 us | }
0) | path_put() {
0) | dput() {
0) | _cond_resched() {
0) | need_resched() {
0) 0.382 us | constant_test_bit();
0) 1. 67 us | }
0) 1.808 us | }
0) 0.399 us | _spin_lock();
0) 0.452 us | _spin_lock();
0) 4.270 us | }
0) | mntput() {
0) 0.375 us | mntput_no_expire();
0) 1. 62 us | }
0) 6.547 us | }
0) + 32.702 us | }
0) + 50.413 us | }
0) | path_put() {
0) 0.421 us | dput();
0) | mntput() {
0) 0.364 us | mntput_no_expire();
0) 1. 64 us | }
0) 2.545 us | }
0) + 54.147 us | }
0) + 55.780 us | }
0) + 59.714 us | }
0) | may_open() {
0) | inode_permission() {
0) | ext3_permission() {
0) 0.406 us | generic_permission();
0) 1.189 us | }
0) | security_inode_permission() {
0) 0.388 us | cap_inode_permission();
0) 1.175 us | }
0) 3.498 us | }
0) 4.328 us | }
0) | nameidata_to_filp() {
0) | __dentry_open() {
0) | file_move() {
0) 0.361 us | _spin_lock();
0) 1.102 us | }
0) | security_dentry_open() {
0) 0.356 us | cap_dentry_open();
0) 1.121 us | }
0) 0.400 us | generic_file_open();
0) 0.544 us | file_ra_state_init();
0) 5. 11 us | }
0) 5.709 us | }
0) + 71.181 us | }
0) 0.453 us | inotify_dentry_parent_queue_event();
0) 0.403 us | inotify_inode_queue_event();
0) | fd_install() {
0) 0.411 us | _spin_lock();
0) 1.217 us | }
0) | putname() {
0) | kmem_cache_free() {
0) | virt_to_head_page() {
0) 0.371 us | constant_test_bit();
0) 1. 47 us | }
0) 1.752 us | }
0) 2.446 us | }
0) + 84.676 us | }
0) + 85.365 us | }

Ingo

2008-11-28 22:21:52

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs

Ingo Molnar a ?crit :
> * Al Viro <[email protected]> wrote:
>
>> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>>> refcounting on permanent system vfs.
>>> Use this function for sockets, pipes, anonymous fds.
>> IMO that's pushing it past the point of usefulness; unless you can show
>> that this really gives considerable win on pipes et.al. *AND* that it
>> doesn't hurt other loads...
>
> The numbers look pretty convincing:
>
>>> (socket8 bench result : from 2.94s to 2.23s)
>
> And i wouldnt expect it to hurt real-filesystem workloads.
>
> Here's the contemporary trace of a typical ext3- sys_open():
>
> 0) | sys_open() {
> 0) | do_sys_open() {
> 0) | getname() {
> 0) 0.367 us | kmem_cache_alloc();
> 0) | strncpy_from_user(); {
> 0) | _cond_resched() {
> 0) | need_resched() {
> 0) 0.363 us | constant_test_bit();
> 0) 1. 47 us | }
> 0) 1.815 us | }
> 0) 2.587 us | }
> 0) 4. 22 us | }
> 0) | alloc_fd() {
> 0) 0.480 us | _spin_lock();
> 0) 0.487 us | expand_files();
> 0) 2.356 us | }
> 0) | do_filp_open() {
> 0) | path_lookup_open() {
> 0) | get_empty_filp() {
> 0) 0.439 us | kmem_cache_alloc();
> 0) | security_file_alloc() {
> 0) 0.316 us | cap_file_alloc_security();
> 0) 1. 87 us | }
> 0) 3.189 us | }
> 0) | do_path_lookup() {
> 0) 0.366 us | _read_lock();
> 0) | path_walk() {
> 0) | __link_path_walk() {
> 0) | inode_permission() {
> 0) | ext3_permission() {
> 0) 0.441 us | generic_permission();
> 0) 1.247 us | }
> 0) | security_inode_permission() {
> 0) 0.411 us | cap_inode_permission();
> 0) 1.186 us | }
> 0) 3.555 us | }
> 0) | do_lookup() {
> 0) | __d_lookup() {
> 0) 0.486 us | _spin_lock();
> 0) 1.369 us | }
> 0) 0.442 us | __follow_mount();
> 0) 3. 14 us | }
> 0) | path_to_nameidata() {
> 0) 0.476 us | dput();
> 0) 1.235 us | }
> 0) | inode_permission() {
> 0) | ext3_permission() {
> 0) | generic_permission() {
> 0) | in_group_p() {
> 0) 0.410 us | groups_search();
> 0) 1.172 us | }
> 0) 1.994 us | }
> 0) 2.789 us | }
> 0) | security_inode_permission() {
> 0) 0.454 us | cap_inode_permission();
> 0) 1.238 us | }
> 0) 5.262 us | }
> 0) | do_lookup() {
> 0) | __d_lookup() {
> 0) 0.480 us | _spin_lock();
> 0) 1.621 us | }
> 0) 0.456 us | __follow_mount();
> 0) 3.215 us | }
> 0) | path_to_nameidata() {
> 0) 0.420 us | dput();
> 0) 1.193 us | }
> 0) + 23.551 us | }
> 0) | path_put() {
> 0) 0.420 us | dput();
> 0) | mntput() {
> 0) 0.359 us | mntput_no_expire();
> 0) 1. 50 us | }
> 0) 2.544 us | }
> 0) + 27.253 us | }
> 0) + 28.850 us | }
> 0) + 33.217 us | }
> 0) | may_open() {
> 0) | inode_permission() {
> 0) | ext3_permission() {
> 0) 0.480 us | generic_permission();
> 0) 1.229 us | }
> 0) | security_inode_permission() {
> 0) 0.405 us | cap_inode_permission();
> 0) 1.196 us | }
> 0) 3.589 us | }
> 0) 4.600 us | }
> 0) | nameidata_to_filp() {
> 0) | __dentry_open() {
> 0) | file_move() {
> 0) 0.470 us | _spin_lock();
> 0) 1.243 us | }
> 0) | security_dentry_open() {
> 0) 0.344 us | cap_dentry_open();
> 0) 1.139 us | }
> 0) 0.412 us | generic_file_open();
> 0) 0.561 us | file_ra_state_init();
> 0) 5.714 us | }
> 0) 6.483 us | }
> 0) + 46.494 us | }
> 0) 0.453 us | inotify_dentry_parent_queue_event();
> 0) 0.403 us | inotify_inode_queue_event();
> 0) | fd_install() {
> 0) 0.440 us | _spin_lock();
> 0) 1.247 us | }
> 0) | putname() {
> 0) | kmem_cache_free() {
> 0) | virt_to_head_page() {
> 0) 0.369 us | constant_test_bit();
> 0) 1. 23 us | }
> 0) 1.738 us | }
> 0) 2.422 us | }
> 0) + 60.560 us | }
> 0) + 61.368 us | }
>
> and here's a sys_close():
>
> 0) | sys_close() {
> 0) 0.540 us | _spin_lock();
> 0) | filp_close() {
> 0) 0.437 us | dnotify_flush();
> 0) 0.401 us | locks_remove_posix();
> 0) 0.349 us | fput();
> 0) 2.679 us | }
> 0) 4.452 us | }
>
> i'd be surprised to see a flag to show up in that codepath. Eric, does
> your testing confirm that?

On a socket/pipe, definitly no, because inode->i_sb->s_flags is not contended.

But on a shared inode, it might hurt :

offsetof(struct inode, i_count)=0x24
offsetof(struct inode, i_lock)=0x70
offsetof(struct inode, i_sb)=0x9c
offsetof(struct inode, i_writecount)=0x144

So i_sb sits in a probably contended cache line

I wonder why i_writecount sits so far from i_count, that doesnt make sense.

2008-11-28 22:38:50

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs

Al Viro a ?crit :
> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>> refcounting on permanent system vfs.
>> Use this function for sockets, pipes, anonymous fds.
>
> IMO that's pushing it past the point of usefulness; unless you can show
> that this really gives considerable win on pipes et.al. *AND* that it
> doesn't hurt other loads...

Well, if this is the last cache line that might be shared, then yes, numbers can talk.
But coming from 10 to 1 instead of 0 is OK I guess

>
> dput() part: again, I want to see what happens on other loads; it's probably
> fine (and win is certainly more than from mntput() change), but... The
> thing is, atomic_dec_and_lock() in there is often done on dentries with
> d_count > 1 and that's fairly cheap (and doesn't involve contention on
> dcache_lock on sane targets).
>
> FWIW, unless there's a really good reason to do alpha atomic_dec_and_lock()
> in a special way, I'd try to compare with

> if (atomic_add_unless(&dentry->d_count, -1, 1))
> return;

I dont know, but *reading* d_count before trying to write it is expensive
on modern cpus. Oprofile clearly show that on Intel Core2.

Then, *testing* the flag before doing the atomic_something() has the same
problem. Or we should put flag in a different cache line.

I am lazy (time for a sleep here), maybe we are smart here and use a trick like that already ?

atomic_t atomic_read_with_write_intent(atomic_t *v)
{
int val = 0;
/*
* No LOCK prefix here, we only give a write intent hint to cpu
*/
asm volatile("xaddl %0, %1"
: "+r" (val), "+m" (v->counter)
: : "memory");
return val;
}



> if (your flag)
> sod off to special
> spin_lock(&dcache_lock);
> if (atomic_dec_and_test(&dentry->d_count)) {
> spin_unlock(&dcache_lock);
> return;
> }
> the rest as usual
>

2008-11-28 22:44:24

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs

Eric Dumazet a ?crit :
> Al Viro a ?crit :
>> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote:
>>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
>>> refcounting on permanent system vfs.
>>> Use this function for sockets, pipes, anonymous fds.
>>
>> IMO that's pushing it past the point of usefulness; unless you can show
>> that this really gives considerable win on pipes et.al. *AND* that it
>> doesn't hurt other loads...
>
> Well, if this is the last cache line that might be shared, then yes,
> numbers can talk.
> But coming from 10 to 1 instead of 0 is OK I guess
>
>>
>> dput() part: again, I want to see what happens on other loads; it's
>> probably
>> fine (and win is certainly more than from mntput() change), but... The
>> thing is, atomic_dec_and_lock() in there is often done on dentries with
>> d_count > 1 and that's fairly cheap (and doesn't involve contention on
>> dcache_lock on sane targets).
>>
>> FWIW, unless there's a really good reason to do alpha
>> atomic_dec_and_lock()
>> in a special way, I'd try to compare with
>
>> if (atomic_add_unless(&dentry->d_count, -1, 1))
>> return;
>
> I dont know, but *reading* d_count before trying to write it is expensive
> on modern cpus. Oprofile clearly show that on Intel Core2.
>
> Then, *testing* the flag before doing the atomic_something() has the same
> problem. Or we should put flag in a different cache line.
>
> I am lazy (time for a sleep here), maybe we are smart here and use a
> trick like that already ?
>
> atomic_t atomic_read_with_write_intent(atomic_t *v)
> {
> int val = 0;
> /*
> * No LOCK prefix here, we only give a write intent hint to cpu
> */
> asm volatile("xaddl %0, %1"
> : "+r" (val), "+m" (v->counter)
> : : "memory");
> return val;
> }

Forget it, its wrong... I really need to sleep :)

2008-11-29 06:38:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP

On Fri, Nov 28, 2008 at 07:47:56PM +0100, Peter Zijlstra wrote:
> > Wow, that's incredibly impressive! :-)
>
> Yeah, we got a similar speedup on -rt by pushing those super-block files
> list into per-cpu lists and doing crazy locking on them.
>
> Of course avoiding them all together, like done here is a nicer option
> but is sadly not a possibility for regular files (until hch gets around
> to removing the need for the list).

We should have finished this long ago, thanks for the reminder.

2008-11-29 08:08:07

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP

Christoph Hellwig a ?crit :
> On Fri, Nov 28, 2008 at 07:47:56PM +0100, Peter Zijlstra wrote:
>>> Wow, that's incredibly impressive! :-)
>> Yeah, we got a similar speedup on -rt by pushing those super-block files
>> list into per-cpu lists and doing crazy locking on them.
>>
>> Of course avoiding them all together, like done here is a nicer option
>> but is sadly not a possibility for regular files (until hch gets around
>> to removing the need for the list).
>
> We should have finished this long ago, thanks for the reminder.
>
>

inode_in_use could be percpu, at least.

Or just zap it, since we never have to scan it.

2008-11-29 08:44:15

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP

Hi all

Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 2.9 seconds (2.3 seconds with SLUB tweaks))

Long version :

For this second version, I removed the mntput()/mntget() optimization
since most reviewers are not convinced it is usefull.
This is a four lines patch that can be reconsidered later.

I chose the name SINGLE instead of SPECIAL to name
isolated dentries (for sockets, pipes, anonymous fd) that
have no parent and no relationship in the vfs.

Thanks all

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
but would be faster if 'struct files' were using SLAB_DESTROY_BY_RCU
and avoid call_rcu() cache killer)

1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list (for sockets/pipes, this is useless)
- dirties superblock s_inodes. - dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount


At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.

This patch series get rid of all but one contended cache lines for
sockets, pipes and anonymous fd (signalfd, timerfd, ...)

Sample program :

for (i = 0; i < 1000000; i++)
close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real 1.561s
user 0.092s
sys 1.469s

Cost if 8 processes are launched on a 8 CPU machine
(benchmark named socket8) :

real 27.496s <<<< !!!! >>>>
user 0.657s
sys 3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples cum. samples % cum. % symbol name
3347352 3347352 28.0232 28.0232 _atomic_dec_and_lock
3301428 6648780 27.6388 55.6620 d_instantiate
2971130 9619910 24.8736 80.5355 d_alloc
241318 9861228 2.0203 82.5558 init_file
146190 10007418 1.2239 83.7797 __slab_free
144149 10151567 1.2068 84.9864 inotify_d_instantiate
143971 10295538 1.2053 86.1917 inet_create
137168 10432706 1.1483 87.3401 new_inode
117549 10550255 0.9841 88.3242 add_partial
110795 10661050 0.9275 89.2517 generic_drop_inode
107137 10768187 0.8969 90.1486 kmem_cache_alloc
94029 10862216 0.7872 90.9358 tcp_close
82837 10945053 0.6935 91.6293 dput
67486 11012539 0.5650 92.1943 dentry_iput
57751 11070290 0.4835 92.6778 iput
54327 11124617 0.4548 93.1326 tcp_v4_init_sock
49921 11174538 0.4179 93.5505 sysenter_past_esp
47616 11222154 0.3986 93.9491 kmem_cache_free
30792 11252946 0.2578 94.2069 clear_inode
27540 11280486 0.2306 94.4375 copy_from_user
26509 11306995 0.2219 94.6594 init_timer
26363 11333358 0.2207 94.8801 discard_slab
25284 11358642 0.2117 95.0918 __fput
22482 11381124 0.1882 95.2800 __percpu_counter_add
20369 11401493 0.1705 95.4505 sock_alloc
18501 11419994 0.1549 95.6054 inet_csk_destroy_sock
17923 11437917 0.1500 95.7555 sys_close


This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.


New cost if run on one cpu :

real 1.325s (instead of 1.561s)
user 0.091s
sys 1.234s


If run on 8 CPUS :

real 0m2.971s
user 0m0.726s
sys 0m21.310s

CPU: Core 2, speed 3000.04 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples cum. samples % cum. % symbol name
189772 189772 12.7205 12.7205 _atomic_dec_and_lock
140467 330239 9.4155 22.1360 __slab_free
128210 458449 8.5940 30.7300 add_partial
121578 580027 8.1494 38.8794 kmem_cache_alloc
72626 652653 4.8681 43.7475 init_file
62720 715373 4.2041 47.9517 __percpu_counter_add
51632 767005 3.4609 51.4126 sysenter_past_esp
49196 816201 3.2976 54.7102 tcp_close
47933 864134 3.2130 57.9231 kmem_cache_free
29628 893762 1.9860 59.9091 copy_from_user
28443 922205 1.9065 61.8157 init_timer
25602 947807 1.7161 63.5318 __slab_alloc
22139 969946 1.4840 65.0158 discard_slab
20428 990374 1.3693 66.3851 __call_rcu
18174 1008548 1.2182 67.6033 alloc_fd
17643 1026191 1.1826 68.7859 __fput
17374 1043565 1.1646 69.9505 d_alloc
17196 1060761 1.1527 71.1031 sys_close
17024 1077785 1.1411 72.2442 inet_create
15208 1092993 1.0194 73.2636 alloc_inode
12201 1105194 0.8178 74.0815 fd_install
12167 1117361 0.8156 74.8970 lock_sock_nested
12123 1129484 0.8126 75.7096 get_empty_filp
11648 1141132 0.7808 76.4904 release_sock
11509 1152641 0.7715 77.2619 dput
11335 1163976 0.7598 78.0216 sock_init_data
11038 1175014 0.7399 78.7615 inet_csk_destroy_sock
10880 1185894 0.7293 79.4908 drop_file_write_access
10083 1195977 0.6759 80.1667 inotify_d_instantiate
9216 1205193 0.6178 80.7844 local_bh_enable_ip
8881 1214074 0.5953 81.3797 sysenter_do_call
8759 1222833 0.5871 81.9668 setup_object
8489 1231322 0.5690 82.5359 iput_single

So we now hit mntput()/mntget() and SLUB.

The last point is about SLUB being hit hard, unless we
use slub_min_order=3 (or slub_min_objects=45) at boot,
or we use Christoph Lameter patch (struct file RCU optimizations)
http://thread.gmane.org/gmane.linux.kernel/418615

If we boot machine with slub_min_order=3, SLUB overhead disappears.

If run on 8 CPUS :

real 0m2.315s
user 0m0.752s
sys 0m17.324s

CPU: Core 2, speed 3000.15 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples cum. samples % cum. % symbol name
199409 199409 15.6440 15.6440 _atomic_dec_and_lock (mntput())
141606 341015 11.1092 26.7532 kmem_cache_alloc
76071 417086 5.9679 32.7211 init_file
70595 487681 5.5383 38.2595 __percpu_counter_add
51595 539276 4.0477 42.3072 sysenter_past_esp
49313 588589 3.8687 46.1759 tcp_close
45503 634092 3.5698 49.7457 kmem_cache_free
41413 675505 3.2489 52.9946 __slab_free
29911 705416 2.3466 55.3412 copy_from_user
28979 734395 2.2735 57.6146 init_timer
22251 756646 1.7456 59.3602 get_empty_filp
19942 776588 1.5645 60.9247 __call_rcu
18348 794936 1.4394 62.3642 __fput
18328 813264 1.4379 63.8020 alloc_fd
17395 830659 1.3647 65.1667 sys_close
17301 847960 1.3573 66.5240 d_alloc
16570 864530 1.2999 67.8239 inet_create
15522 880052 1.2177 69.0417 alloc_inode
13185 893237 1.0344 70.0761 setup_object
12359 905596 0.9696 71.0456 fd_install
12275 917871 0.9630 72.0086 lock_sock_nested
11924 929795 0.9355 72.9441 release_sock
11790 941585 0.9249 73.8690 sock_init_data
11310 952895 0.8873 74.7563 dput
10924 963819 0.8570 75.6133 drop_file_write_access
10903 974722 0.8554 76.4687 inet_csk_destroy_sock
10184 984906 0.7990 77.2676 inotify_d_instantiate
9372 994278 0.7353 78.0029 local_bh_enable_ip
8901 1003179 0.6983 78.7012 sysenter_do_call
8569 1011748 0.6723 79.3735 iput_single
8194 1019942 0.6428 80.0163 inet_release


This patch serie contains 5 patches, against net-next-2.6 tree
(because this tree already contains network improvement on this
subject, but should apply on other trees)

[PATCH 1/5] fs: Use a percpu_counter to track nr_dentry

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

(socket8 bench result : 27.5s to 25s)

[PATCH 2/5] fs: Use a percpu_counter to track nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : no difference at this point)

[PATCH 3/5] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)

(socket8 bench result : no difference)


[PATCH 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

[PATCH 5/5] fs: new_inode_single() and iput_single()

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid
taking it as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of
three contended cache lines in new_inode(), and five cache lines
in iput()

(socket8 bench result : from 19.9s to 2.3s)


Signed-off-by: Eric Dumazet <[email protected]>
---
Overall diffstat :

fs/anon_inodes.c | 18 ------
fs/dcache.c | 100 ++++++++++++++++++++++++++++++--------
fs/fs-writeback.c | 2
fs/inode.c | 101 +++++++++++++++++++++++++++++++--------
fs/pipe.c | 25 +--------
include/linux/dcache.h | 9 +++
include/linux/fs.h | 17 ++++++
kernel/sysctl.c | 6 +-
mm/page-writeback.c | 2
net/socket.c | 26 +---------
10 files changed, 200 insertions(+), 106 deletions(-)

2008-11-29 08:44:46

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry

diff --git a/fs/dcache.c b/fs/dcache.c
index a1d86c7..46d5d1e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly;
static unsigned int d_hash_mask __read_mostly;
static unsigned int d_hash_shift __read_mostly;
static struct hlist_head *dentry_hashtable __read_mostly;
+static struct percpu_counter nr_dentry;

/* Statistics gathering. */
struct dentry_stat_t dentry_stat = {
.age_limit = 45,
};

+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
+ return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ return -ENOSYS;
+}
+#endif
+
static void __d_free(struct dentry *dentry)
{
WARN_ON(!list_empty(&dentry->d_alias));
@@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head)
}

/*
- * no dcache_lock, please. The caller must decrement dentry_stat.nr_dentry
- * inside dcache_lock.
+ * no dcache_lock, please.
*/
static void d_free(struct dentry *dentry)
{
@@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry)
__d_free(dentry);
else
call_rcu(&dentry->d_u.d_rcu, d_callback);
+ percpu_counter_dec(&nr_dentry);
}

/*
@@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry)
struct dentry *parent;

list_del(&dentry->d_u.d_child);
- dentry_stat.nr_dentry--; /* For d_free, below */
/*drops the locks, at that point nobody can reach this dentry */
dentry_iput(dentry);
if (IS_ROOT(dentry))
@@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb)
static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
{
struct dentry *parent;
- unsigned detached = 0;

BUG_ON(!IS_ROOT(dentry));

@@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
}

list_del(&dentry->d_u.d_child);
- detached++;

inode = dentry->d_inode;
if (inode) {
@@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
* otherwise we ascend to the parent and move to the
* next sibling if there is one */
if (!parent)
- goto out;
+ return;

dentry = parent;

@@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
dentry = list_entry(dentry->d_subdirs.next,
struct dentry, d_u.d_child);
}
-out:
- /* several dentries were freed, need to correct nr_dentry */
- spin_lock(&dcache_lock);
- dentry_stat.nr_dentry -= detached;
- spin_unlock(&dcache_lock);
}

/*
@@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
dentry->d_flags = DCACHE_UNHASHED;
spin_lock_init(&dentry->d_lock);
dentry->d_inode = NULL;
- dentry->d_parent = NULL;
- dentry->d_sb = NULL;
dentry->d_op = NULL;
dentry->d_fsdata = NULL;
dentry->d_mounted = 0;
@@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
if (parent) {
dentry->d_parent = dget(parent);
dentry->d_sb = parent->d_sb;
+ spin_lock(&dcache_lock);
+ list_add(&dentry->d_u.d_child, &parent->d_subdirs);
+ spin_unlock(&dcache_lock);
} else {
+ dentry->d_parent = NULL;
+ dentry->d_sb = NULL;
INIT_LIST_HEAD(&dentry->d_u.d_child);
}
-
- spin_lock(&dcache_lock);
- if (parent)
- list_add(&dentry->d_u.d_child, &parent->d_subdirs);
- dentry_stat.nr_dentry++;
- spin_unlock(&dcache_lock);
-
+ percpu_counter_inc(&nr_dentry);
return dentry;
}

@@ -2282,6 +2290,7 @@ static void __init dcache_init(void)
{
int loop;

+ percpu_counter_init(&nr_dentry, 0);
/*
* A constructor could be added for stable state like the lists,
* but it is probably not worth it because of the cache nature
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0dcdd94..c5e7aa5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2216,6 +2216,8 @@ static inline void free_secdata(void *secdata)
struct ctl_table;
int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos);

int get_filesystem_list(char * buf);

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 9d048fa..eebddef 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1243,7 +1243,7 @@ static struct ctl_table fs_table[] = {
.data = &dentry_stat,
.maxlen = 6*sizeof(int),
.mode = 0444,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_nr_dentry,
},
{
.ctl_name = FS_OVERFLOWUID,


Attachments:
nr_dentry.patch (4.78 kB)

2008-11-29 08:45:20

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d0ff0b8..b591cdd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);

wbc.nr_to_write = nr_dirty + nr_unstable +
- (inodes_stat.nr_inodes - inodes_stat.nr_unused) +
+ (get_nr_inodes() - inodes_stat.nr_unused) +
nr_dirty + nr_unstable;
wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */
sync_sb_inodes(sb, &wbc);
diff --git a/fs/inode.c b/fs/inode.c
index 0487ddb..f94f889 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex);
* Statistics gathering..
*/
struct inodes_stat_t inodes_stat;
+static struct percpu_counter nr_inodes;

static struct kmem_cache * inode_cachep __read_mostly;

+int get_nr_inodes(void)
+{
+ return percpu_counter_sum_positive(&nr_inodes);
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ inodes_stat.nr_inodes = get_nr_inodes();
+ return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ return -ENOSYS;
+}
+#endif
+
static void wake_up_inode(struct inode *inode)
{
/*
@@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head)
destroy_inode(inode);
nr_disposed++;
}
- spin_lock(&inode_lock);
- inodes_stat.nr_inodes -= nr_disposed;
- spin_unlock(&inode_lock);
+ percpu_counter_sub(&nr_inodes, nr_disposed);
}

/*
@@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb)

inode = alloc_inode(sb);
if (inode) {
+ percpu_counter_inc(&nr_inodes);
spin_lock(&inode_lock);
- inodes_stat.nr_inodes++;
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_sb_list, &sb->s_inodes);
inode->i_ino = ++last_ino;
@@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
if (set(inode, data))
goto set_failed;

- inodes_stat.nr_inodes++;
+ percpu_counter_inc(&nr_inodes);
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_sb_list, &sb->s_inodes);
hlist_add_head(&inode->i_hash, head);
@@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
old = find_inode_fast(sb, head, ino);
if (!old) {
inode->i_ino = ino;
- inodes_stat.nr_inodes++;
+ percpu_counter_inc(&nr_inodes);
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_sb_list, &sb->s_inodes);
hlist_add_head(&inode->i_hash, head);
@@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode)
list_del_init(&inode->i_list);
list_del_init(&inode->i_sb_list);
inode->i_state |= I_FREEING;
- inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
+ percpu_counter_dec(&nr_inodes);

security_inode_delete(inode);

@@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode)
list_del_init(&inode->i_list);
list_del_init(&inode->i_sb_list);
inode->i_state |= I_FREEING;
- inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
+ percpu_counter_dec(&nr_inodes);
if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
@@ -1394,6 +1416,7 @@ void __init inode_init(void)
{
int loop;

+ percpu_counter_init(&nr_inodes, 0);
/* inode slab cache */
inode_cachep = kmem_cache_create("inode_cache",
sizeof(struct inode),
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c5e7aa5..2482977 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -47,6 +47,7 @@ struct inodes_stat_t {
int dummy[5]; /* padding for sysctl ABI compatibility */
};
extern struct inodes_stat_t inodes_stat;
+extern int get_nr_inodes(void);

extern int leases_enable, lease_break_time;

@@ -2218,6 +2219,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos);
int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos);

int get_filesystem_list(char * buf);

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index eebddef..eebed01 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1202,7 +1202,7 @@ static struct ctl_table fs_table[] = {
.data = &inodes_stat,
.maxlen = 2*sizeof(int),
.mode = 0444,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_nr_inodes,
},
{
.ctl_name = FS_STATINODE,
@@ -1210,7 +1210,7 @@ static struct ctl_table fs_table[] = {
.data = &inodes_stat,
.maxlen = 7*sizeof(int),
.mode = 0444,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_nr_inodes,
},
{
.procname = "file-nr",
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2970e35..a71a922 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
next_jif = start_jif + dirty_writeback_interval;
nr_to_write = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS) +
- (inodes_stat.nr_inodes - inodes_stat.nr_unused);
+ (get_nr_inodes() - inodes_stat.nr_unused);
while (nr_to_write > 0) {
wbc.more_io = 0;
wbc.encountered_congestion = 0;


Attachments:
nr_inodes.patch (5.49 kB)

2008-11-29 08:46:07

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..8bf83cb 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags,
mnt);
}

-static int anon_inodefs_delete_dentry(struct dentry *dentry)
-{
- /*
- * We faked vfs to believe the dentry was hashed when we created it.
- * Now we restore the flag so that dput() will work correctly.
- */
- dentry->d_flags |= DCACHE_UNHASHED;
- return 1;
-}
-
static struct file_system_type anon_inode_fs_type = {
.name = "anon_inodefs",
.get_sb = anon_inodefs_get_sb,
.kill_sb = kill_anon_super,
};
static struct dentry_operations anon_inodefs_dentry_operations = {
- .d_delete = anon_inodefs_delete_dentry,
};

/**
@@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
this.name = name;
this.len = strlen(name);
this.hash = 0;
- dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+ dentry = d_alloc_single(&this, anon_inode_inode);
if (!dentry)
goto err_put_unused_fd;

@@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
atomic_inc(&anon_inode_inode->i_count);

dentry->d_op = &anon_inodefs_dentry_operations;
- /* Do not publish this dentry inside the global dentry hash table */
- dentry->d_flags &= ~DCACHE_UNHASHED;
- d_instantiate(dentry, anon_inode_inode);

error = -ENFILE;
file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index 46d5d1e..35d4a25 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry)
*/

/*
+ * special version of dput() for pipes/sockets/anon.
+ * These dentries are not present in hash table, we can avoid
+ * taking/dirtying dcache_lock
+ */
+static void dput_single(struct dentry *dentry)
+{
+ struct inode *inode;
+
+ if (!atomic_dec_and_test(&dentry->d_count))
+ return;
+ inode = dentry->d_inode;
+ if (inode)
+ iput(inode);
+ d_free(dentry);
+}
+
+/*
* dput - release a dentry
* @dentry: dentry to release
*
@@ -234,6 +251,11 @@ void dput(struct dentry *dentry)
{
if (!dentry)
return;
+ /*
+ * single dentries (sockets/pipes/anon) fast path
+ */
+ if (dentry->d_flags & DCACHE_SINGLE)
+ return dput_single(dentry);

repeat:
if (atomic_read(&dentry->d_count) == 1)
@@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode)
return res;
}

+/**
+ * d_alloc_single - allocate SINGLE dentry
+ * @name: dentry name, given in a qstr structure
+ * @inode: inode to allocate the dentry for
+ *
+ * Allocate an SINGLE dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory.
+ * - SINGLE dentries have themselves as a parent.
+ * - SINGLE dentries are not hashed into global hash table
+ * - their d_alias list is empty
+ */
+struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
+{
+ struct dentry *entry;
+
+ entry = d_alloc(NULL, name);
+ if (entry) {
+ entry->d_sb = inode->i_sb;
+ entry->d_parent = entry;
+ entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
+ entry->d_inode = inode;
+ fsnotify_d_instantiate(entry, inode);
+ security_d_instantiate(entry, inode);
+ }
+ return entry;
+}
+
+
static inline struct hlist_head *d_hash(struct dentry *parent,
unsigned long hash)
{
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..4de6dd5 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode)
}

static struct vfsmount *pipe_mnt __read_mostly;
-static int pipefs_delete_dentry(struct dentry *dentry)
-{
- /*
- * At creation time, we pretended this dentry was hashed
- * (by clearing DCACHE_UNHASHED bit in d_flags)
- * At delete time, we restore the truth : not hashed.
- * (so that dput() can proceed correctly)
- */
- dentry->d_flags |= DCACHE_UNHASHED;
- return 0;
-}

/*
* pipefs_dname() is called from d_path().
@@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
}

static struct dentry_operations pipefs_dentry_operations = {
- .d_delete = pipefs_delete_dentry,
.d_dname = pipefs_dname,
};

@@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
struct inode *inode;
struct file *f;
struct dentry *dentry;
- struct qstr name = { .name = "" };
+ static const struct qstr name = { .name = "" };

err = -ENFILE;
inode = get_pipe_inode();
@@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags)
goto err;

err = -ENOMEM;
- dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+ dentry = d_alloc_single(&name, inode);
if (!dentry)
goto err_inode;

dentry->d_op = &pipefs_dentry_operations;
- /*
- * We dont want to publish this dentry into global dentry hash table.
- * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
- * This permits a working /proc/$pid/fd/XXX on pipes
- */
- dentry->d_flags &= ~DCACHE_UNHASHED;
- d_instantiate(dentry, inode);

err = -ENFILE;
f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..ca8d269 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -176,6 +176,14 @@ d_iput: no no no yes
#define DCACHE_UNHASHED 0x0010

#define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched */
+#define DCACHE_SINGLE 0x0040
+ /*
+ * socket, pipe or anonymous fd dentry
+ * - SINGLE dentries have themselves as a parent.
+ * - SINGLE dentries are not hashed into global hash table
+ * - Their d_alias list is empty
+ * - They dont need dcache_lock synchronization
+ */

extern spinlock_t dcache_lock;
extern seqlock_t rename_lock;
@@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *);
extern void shrink_dcache_parent(struct dentry *);
extern void shrink_dcache_for_umount(struct super_block *);
extern int d_invalidate(struct dentry *);
+extern struct dentry *d_alloc_single(const struct qstr *, struct inode *);

/* only used at mount-time */
extern struct dentry * d_alloc_root(struct inode *);
diff --git a/net/socket.c b/net/socket.c
index e9d65ea..231cd66 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -307,18 +307,6 @@ static struct file_system_type sock_fs_type = {
.kill_sb = kill_anon_super,
};

-static int sockfs_delete_dentry(struct dentry *dentry)
-{
- /*
- * At creation time, we pretended this dentry was hashed
- * (by clearing DCACHE_UNHASHED bit in d_flags)
- * At delete time, we restore the truth : not hashed.
- * (so that dput() can proceed correctly)
- */
- dentry->d_flags |= DCACHE_UNHASHED;
- return 0;
-}
-
/*
* sockfs_dname() is called from d_path().
*/
@@ -329,7 +317,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
}

static struct dentry_operations sockfs_dentry_operations = {
- .d_delete = sockfs_delete_dentry,
.d_dname = sockfs_dname,
};

@@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
{
struct dentry *dentry;
- struct qstr name = { .name = "" };
+ static const struct qstr name = { .name = "" };

- dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+ dentry = d_alloc_single(&name, SOCK_INODE(sock));
if (unlikely(!dentry))
return -ENOMEM;

dentry->d_op = &sockfs_dentry_operations;
- /*
- * We dont want to push this dentry into global dentry hash table.
- * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
- * This permits a working /proc/$pid/fd/XXX on sockets
- */
- dentry->d_flags &= ~DCACHE_UNHASHED;
- d_instantiate(dentry, SOCK_INODE(sock));

sock->file = file;
init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,


Attachments:
dcache_single.patch (7.70 kB)

2008-11-29 08:45:49

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator

diff --git a/fs/inode.c b/fs/inode.c
index f94f889..dc8e72a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -556,6 +556,36 @@ repeat:
return node ? inode : NULL;
}

+#ifdef CONFIG_SMP
+/*
+ * Each cpu owns a range of 1024 numbers.
+ * 'shared_last_ino' is dirtied only once out of 1024 allocations,
+ * to renew the exhausted range.
+ */
+static DEFINE_PER_CPU(int, last_ino);
+
+static int last_ino_get(void)
+{
+ static atomic_t shared_last_ino;
+ int *p = &get_cpu_var(last_ino);
+ int res = *p;
+
+ if (unlikely((res & 1023) == 0))
+ res = atomic_add_return(1024, &shared_last_ino) - 1024;
+
+ *p = ++res;
+ put_cpu_var(last_ino);
+ return res;
+}
+#else
+static int last_ino_get(void)
+{
+ static int last_ino;
+
+ return ++last_ino;
+}
+#endif
+
/**
* new_inode - obtain an inode
* @sb: superblock
@@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
* error if st_ino won't fit in target struct field. Use 32bit counter
* here to attempt to avoid that.
*/
- static unsigned int last_ino;
struct inode * inode;

spin_lock_prefetch(&inode_lock);
@@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
inode = alloc_inode(sb);
if (inode) {
percpu_counter_inc(&nr_inodes);
+ inode->i_state = 0;
+ inode->i_ino = last_ino_get();
spin_lock(&inode_lock);
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_sb_list, &sb->s_inodes);
- inode->i_ino = ++last_ino;
- inode->i_state = 0;
spin_unlock(&inode_lock);
}
return inode;


Attachments:
last_ino.patch (1.48 kB)

2008-11-29 08:46:29

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v2 5/5] fs: new_inode_single() and iput_single()

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 8bf83cb..89fd36d 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
*/
static struct inode *anon_inode_mkinode(void)
{
- struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
+ struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb);

if (!inode)
return ERR_PTR(-ENOMEM);
diff --git a/fs/dcache.c b/fs/dcache.c
index 35d4a25..3aa9ed5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry)
return;
inode = dentry->d_inode;
if (inode)
- iput(inode);
+ iput_single(inode);
d_free(dentry);
}

diff --git a/fs/inode.c b/fs/inode.c
index dc8e72a..0fdfe1b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode)
kmem_cache_free(inode_cachep, (inode));
}

+void iput_single(struct inode *inode)
+{
+ if (atomic_dec_and_test(&inode->i_count)) {
+ destroy_inode(inode);
+ percpu_counter_dec(&nr_inodes);
+ }
+}

/*
* These are initializations that only need to be done
@@ -587,8 +594,9 @@ static int last_ino_get(void)
#endif

/**
- * new_inode - obtain an inode
+ * __new_inode - obtain an inode
* @sb: superblock
+ * @single: if true, dont link new inode in a list
*
* Allocates a new inode for given superblock. The default gfp_mask
* for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
@@ -598,7 +606,7 @@ static int last_ino_get(void)
* newly created inode's mapping
*
*/
-struct inode *new_inode(struct super_block *sb)
+struct inode *__new_inode(struct super_block *sb, int single)
{
/*
* On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
@@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb)
*/
struct inode * inode;

- spin_lock_prefetch(&inode_lock);
-
inode = alloc_inode(sb);
if (inode) {
percpu_counter_inc(&nr_inodes);
inode->i_state = 0;
inode->i_ino = last_ino_get();
- spin_lock(&inode_lock);
- list_add(&inode->i_list, &inode_in_use);
- list_add(&inode->i_sb_list, &sb->s_inodes);
- spin_unlock(&inode_lock);
+ if (single) {
+ INIT_LIST_HEAD(&inode->i_list);
+ INIT_LIST_HEAD(&inode->i_sb_list);
+ } else {
+ spin_lock(&inode_lock);
+ list_add(&inode->i_list, &inode_in_use);
+ list_add(&inode->i_sb_list, &sb->s_inodes);
+ spin_unlock(&inode_lock);
+ }
}
return inode;
}

-EXPORT_SYMBOL(new_inode);
+EXPORT_SYMBOL(__new_inode);

void unlock_new_inode(struct inode *inode)
{
diff --git a/fs/pipe.c b/fs/pipe.c
index 4de6dd5..8c51a0d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = {

static struct inode * get_pipe_inode(void)
{
- struct inode *inode = new_inode(pipe_mnt->mnt_sb);
+ struct inode *inode = new_inode_single(pipe_mnt->mnt_sb);
struct pipe_inode_info *pipe;

if (!inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2482977..b3daffc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1898,7 +1898,17 @@ extern void __iget(struct inode * inode);
extern void iget_failed(struct inode *);
extern void clear_inode(struct inode *);
extern void destroy_inode(struct inode *);
-extern struct inode *new_inode(struct super_block *);
+extern struct inode *__new_inode(struct super_block *, int);
+static inline struct inode *new_inode(struct super_block *sb)
+{
+ return __new_inode(sb, 0);
+}
+static inline struct inode *new_inode_single(struct super_block *sb)
+{
+ return __new_inode(sb, 1);
+}
+extern void iput_single(struct inode *);
+
extern int should_remove_suid(struct dentry *);
extern int file_remove_suid(struct file *);

diff --git a/net/socket.c b/net/socket.c
index 231cd66..f1e656c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -463,7 +463,7 @@ static struct socket *sock_alloc(void)
struct inode *inode;
struct socket *sock;

- inode = new_inode(sock_mnt->mnt_sb);
+ inode = new_inode_single(sock_mnt->mnt_sb);
if (!inode)
return NULL;


Attachments:
new_inode_single.patch (3.98 kB)

2008-11-29 10:39:29

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd

On Sat, 29 November 2008 09:44:23 +0100, Eric Dumazet wrote:
>
> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
> +{
> + struct dentry *entry;
> +
> + entry = d_alloc(NULL, name);
> + if (entry) {
> + entry->d_sb = inode->i_sb;
> + entry->d_parent = entry;
> + entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
> + entry->d_inode = inode;
> + fsnotify_d_instantiate(entry, inode);
> + security_d_instantiate(entry, inode);
> + }
> + return entry;

Calling the struct dentry entry had me onfused a bit. I believe
everyone else (including the code you removed) uses dentry.

> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
> struct inode *inode;
> struct file *f;
> struct dentry *dentry;
> - struct qstr name = { .name = "" };
> + static const struct qstr name = { .name = "" };
>
> err = -ENFILE;
> inode = get_pipe_inode();
...
> @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
> static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
> {
> struct dentry *dentry;
> - struct qstr name = { .name = "" };
> + static const struct qstr name = { .name = "" };

These two could even be combined.

And of course I realize that I comment on absolute trivialities. On the
whole, I couldn't spot a real problem in your patches.

Jörn

--
Public Domain - Free as in Beer
General Public - Free as in Speech
BSD License - Free as in Enterprise
Shared Source - Free as in "Work will make you..."

2008-11-29 11:15:13

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd

Jörn Engel a écrit :
> On Sat, 29 November 2008 09:44:23 +0100, Eric Dumazet wrote:
>> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
>> +{
>> + struct dentry *entry;
>> +
>> + entry = d_alloc(NULL, name);
>> + if (entry) {
>> + entry->d_sb = inode->i_sb;
>> + entry->d_parent = entry;
>> + entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
>> + entry->d_inode = inode;
>> + fsnotify_d_instantiate(entry, inode);
>> + security_d_instantiate(entry, inode);
>> + }
>> + return entry;
>
> Calling the struct dentry entry had me onfused a bit. I believe
> everyone else (including the code you removed) uses dentry.

Ah yes, it seems I took it from d_instantiate(), I guess a cleanup
patch would be nice.

>
>> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
>> struct inode *inode;
>> struct file *f;
>> struct dentry *dentry;
>> - struct qstr name = { .name = "" };
>> + static const struct qstr name = { .name = "" };
>>
>> err = -ENFILE;
>> inode = get_pipe_inode();
> ...
>> @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
>> static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
>> {
>> struct dentry *dentry;
>> - struct qstr name = { .name = "" };
>> + static const struct qstr name = { .name = "" };
>
> These two could even be combined.
>
> And of course I realize that I comment on absolute trivialities. On the
> whole, I couldn't spot a real problem in your patches.

Well, at least you reviewed it, it's the important point !

Thanks Jörn

2008-11-29 11:15:40

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH v2 5/5] fs: new_inode_single() and iput_single()

On Sat, 29 November 2008 09:45:09 +0100, Eric Dumazet wrote:
>
> +void iput_single(struct inode *inode)
> +{
> + if (atomic_dec_and_test(&inode->i_count)) {
> + destroy_inode(inode);
> + percpu_counter_dec(&nr_inodes);
> + }
> +}

I wonder if it is possible to avoid the atomic_dec_and_test() here, at
least in the common case, and combine it with the atomic_dec_and_test()
of the dentry. A quick look at fs/inode.c indicates that inode->i_count
may never get changed for a SINGLE inode, except during creation or
deletion.

It might be worth to
- remove the conditional from iput_single() and measure that it makes a
difference,
- poison SINGLE inodes with some value and
- put a BUG_ON() in __iget() that checks for the poison value.

I _think_ the BUG_ON() is unnecessary, but at least my brain is not
sufficient to convince me. Can inotify somehow get a hold of a socket?
Or dquot (how insane would that be?)

Jörn

--
Mac is for working,
Linux is for Networking,
Windows is for Solitaire!
-- stolen from dc

2008-12-12 01:51:39

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

On Friday 12 December 2008 09:40, Eric Dumazet wrote:
> From: Christoph Lameter <[email protected]>
>
> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
>
> Currently we schedule RCU frees for each file we free separately. That has
> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
> did not require RCU callbacks:
>
> 1. Excessive number of RCU callbacks can be generated causing long RCU
> queues that in turn cause long latencies. We hit SLUB page allocation
> more often than necessary.
>
> 2. The cache hot object is not preserved between free and realloc. A close
> followed by another open is very fast with the RCUless approach because
> the last freed object is returned by the slab allocator that is
> still cache hot. RCU free means that the object is not immediately
> available again. The new object is cache cold and therefore open/close
> performance tests show a significant degradation with the RCU
> implementation.
>
> One solution to this problem is to move the RCU freeing into the Slab
> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
> time. The slab allocator will do RCU frees only when it is necessary
> to dispose of slabs of objects (rare). So with that approach we can cut
> out the RCU overhead significantly.
>
> However, the slab allocator may return the object for another use even
> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
> there is the (unlikely) possibility that the object is going to be
> switched under us in sections protected by rcu_read_lock() and
> rcu_read_unlock(). So we need to verify that we have acquired the correct
> object after establishing a stable object reference (incrementing the
> refcounter does that).
>
>
> Signed-off-by: Christoph Lameter <[email protected]>
> Signed-off-by: Eric Dumazet <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> Documentation/filesystems/files.txt | 21 ++++++++++++++--
> fs/file_table.c | 33 ++++++++++++++++++--------
> include/linux/fs.h | 5 ---
> 3 files changed, 42 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/filesystems/files.txt
> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
> --- a/Documentation/filesystems/files.txt
> +++ b/Documentation/filesystems/files.txt
> @@ -78,13 +78,28 @@ the fdtable structure -
> that look-up may race with the last put() operation on the
> file structure. This is avoided using atomic_long_inc_not_zero()
> on ->f_count :
> + As file structures are allocated with SLAB_DESTROY_BY_RCU,
> + they can also be freed before a RCU grace period, and reused,
> + but still as a struct file.
> + It is necessary to check again after getting
> + a stable reference (ie after atomic_long_inc_not_zero()),
> + that fcheck_files(files, fd) points to the same file.
>
> rcu_read_lock();
> file = fcheck_files(files, fd);
> if (file) {
> - if (atomic_long_inc_not_zero(&file->f_count))
> + if (atomic_long_inc_not_zero(&file->f_count)) {
> *fput_needed = 1;
> - else
> + /*
> + * Now we have a stable reference to an object.
> + * Check if other threads freed file and reallocated it.
> + */
> + if (file != fcheck_files(files, fd)) {
> + *fput_needed = 0;
> + put_filp(file);
> + file = NULL;
> + }
> + } else
> /* Didn't get the reference, someone's freed */
> file = NULL;
> }
> @@ -95,6 +110,8 @@ the fdtable structure -
> atomic_long_inc_not_zero() detects if refcounts is already zero or
> goes to zero during increment. If it does, we fail
> fget()/fget_light().
> + The second call to fcheck_files(files, fd) checks that this filp
> + was not freed, then reused by an other thread.
>
> 6. Since both fdtable and file structures can be looked up
> lock-free, they must be installed using rcu_assign_pointer()
> diff --git a/fs/file_table.c b/fs/file_table.c
> index a46e880..3e9259d 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
>
> static struct percpu_counter nr_files __cacheline_aligned_in_smp;
>
> -static inline void file_free_rcu(struct rcu_head *head)
> -{
> - struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
> - kmem_cache_free(filp_cachep, f);
> -}
> -
> static inline void file_free(struct file *f)
> {
> percpu_counter_dec(&nr_files);
> file_check_state(f);
> - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> + kmem_cache_free(filp_cachep, f);
> }
>
> /*
> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
> rcu_read_unlock();
> return NULL;
> }
> + /*
> + * Now we have a stable reference to an object.
> + * Check if other threads freed file and re-allocated it.
> + */
> + if (unlikely(file != fcheck_files(files, fd))) {
> + put_filp(file);
> + file = NULL;
> + }

This is a non-trivial change, because that put_filp may drop the last
reference to the file. So now we have the case where we free the file
from a context in which it had never been allocated.

>From a quick glance though the callchains, I can't seen an obvious
problem. But it needs to have documentation in put_filp, or at least
a mention in the changelog, and also cc'ed to the security lists.

Also, it adds code and cost to the get/put path in return for
improvement in the free path. get/put is the more common path, but
it is a small loss for a big improvement. So it might be worth it. But
it is not justified by your microbenchmark. Do we have a more useful
case that it helps?

2008-12-12 02:02:39

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry

On Friday 12 December 2008 09:38, Eric Dumazet wrote:
> Adding a percpu_counter nr_dentry avoids cache line ping pongs
> between cpus to maintain this metric, and dcache_lock is
> no more needed to protect dentry_stat.nr_dentry
>
> We centralize nr_dentry updates at the right place :
> - increments in d_alloc()
> - decrements in d_free()
>
> d_alloc() can avoid taking dcache_lock if parent is NULL
>
> ("socketallocbench -n8" result : 27.5s to 25s)

Seems like a good idea.


> @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct
> dentry *dentry) * otherwise we ascend to the parent and move to the
> * next sibling if there is one */
> if (!parent)
> - goto out;
> + return;
>
> dentry = parent;
>

Andrew doesn't like return from middle of function.

2008-12-12 02:08:49

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes

On Friday 12 December 2008 09:39, Eric Dumazet wrote:
> Avoids cache line ping pongs between cpus and prepare next patch,
> because updates of nr_inodes dont need inode_lock anymore.
>
> (socket8 bench result : no difference at this point)

Looks good.

But.... If we never actually need fast access to the approximate
total, (which seems to apply to this and the previous patch) we
could use something much simpler which does not have the spinlock
or all this batching stuff that percpu counters have. I'd prefer
that because it will be faster in a straight line...

(BTW. percpu counters can't be used in interrupt context? That's
nice.)

2008-12-12 02:12:32

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator

On Friday 12 December 2008 09:39, Eric Dumazet wrote:
> new_inode() dirties a contended cache line to get increasing
> inode numbers.
>
> Solve this problem by providing to each cpu a per_cpu variable,
> feeded by the shared last_ino, but once every 1024 allocations.
>
> This reduce contention on the shared last_ino, and give same
> spreading ino numbers than before.
> (same wraparound after 2^32 allocations)

I don't suppose this would cause any filesystems to do silly
things?

Seems like a good idea, if you could just add a #define instead
of 1024.

>
> Signed-off-by: Eric Dumazet <[email protected]>
> ---
> fs/inode.c | 35 ++++++++++++++++++++++++++++++++---
> 1 files changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index f94f889..dc8e72a 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -556,6 +556,36 @@ repeat:
> return node ? inode : NULL;
> }
>
> +#ifdef CONFIG_SMP
> +/*
> + * Each cpu owns a range of 1024 numbers.
> + * 'shared_last_ino' is dirtied only once out of 1024 allocations,
> + * to renew the exhausted range.
> + */
> +static DEFINE_PER_CPU(int, last_ino);
> +
> +static int last_ino_get(void)
> +{
> + static atomic_t shared_last_ino;
> + int *p = &get_cpu_var(last_ino);
> + int res = *p;
> +
> + if (unlikely((res & 1023) == 0))
> + res = atomic_add_return(1024, &shared_last_ino) - 1024;
> +
> + *p = ++res;
> + put_cpu_var(last_ino);
> + return res;
> +}
> +#else
> +static int last_ino_get(void)
> +{
> + static int last_ino;
> +
> + return ++last_ino;
> +}
> +#endif
> +
> /**
> * new_inode - obtain an inode
> * @sb: superblock
> @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
> * error if st_ino won't fit in target struct field. Use 32bit counter
> * here to attempt to avoid that.
> */
> - static unsigned int last_ino;
> struct inode * inode;
>
> spin_lock_prefetch(&inode_lock);
> @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
> inode = alloc_inode(sb);
> if (inode) {
> percpu_counter_inc(&nr_inodes);
> + inode->i_state = 0;
> + inode->i_ino = last_ino_get();
> spin_lock(&inode_lock);
> list_add(&inode->i_list, &inode_in_use);
> list_add(&inode->i_sb_list, &sb->s_inodes);
> - inode->i_ino = ++last_ino;
> - inode->i_state = 0;
> spin_unlock(&inode_lock);
> }
> return inode;

2008-12-11 22:39:56

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

("socketallocbench -n8" result : 27.5s to 25s)

Signed-off-by: Eric Dumazet <[email protected]>
---
fs/dcache.c | 49 +++++++++++++++++++++++++------------------
include/linux/fs.h | 2 +
kernel/sysctl.c | 2 -
3 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fa1ba03..f463a81 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly;
static unsigned int d_hash_mask __read_mostly;
static unsigned int d_hash_shift __read_mostly;
static struct hlist_head *dentry_hashtable __read_mostly;
+static struct percpu_counter nr_dentry;

/* Statistics gathering. */
struct dentry_stat_t dentry_stat = {
.age_limit = 45,
};

+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
+ return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ return -ENOSYS;
+}
+#endif
+
static void __d_free(struct dentry *dentry)
{
WARN_ON(!list_empty(&dentry->d_alias));
@@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head)
}

/*
- * no dcache_lock, please. The caller must decrement dentry_stat.nr_dentry
- * inside dcache_lock.
+ * no dcache_lock, please.
*/
static void d_free(struct dentry *dentry)
{
@@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry)
__d_free(dentry);
else
call_rcu(&dentry->d_u.d_rcu, d_callback);
+ percpu_counter_dec(&nr_dentry);
}

/*
@@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry)
struct dentry *parent;

list_del(&dentry->d_u.d_child);
- dentry_stat.nr_dentry--; /* For d_free, below */
/*drops the locks, at that point nobody can reach this dentry */
dentry_iput(dentry);
if (IS_ROOT(dentry))
@@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb)
static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
{
struct dentry *parent;
- unsigned detached = 0;

BUG_ON(!IS_ROOT(dentry));

@@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
}

list_del(&dentry->d_u.d_child);
- detached++;

inode = dentry->d_inode;
if (inode) {
@@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
* otherwise we ascend to the parent and move to the
* next sibling if there is one */
if (!parent)
- goto out;
+ return;

dentry = parent;

@@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
dentry = list_entry(dentry->d_subdirs.next,
struct dentry, d_u.d_child);
}
-out:
- /* several dentries were freed, need to correct nr_dentry */
- spin_lock(&dcache_lock);
- dentry_stat.nr_dentry -= detached;
- spin_unlock(&dcache_lock);
}

/*
@@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
dentry->d_flags = DCACHE_UNHASHED;
spin_lock_init(&dentry->d_lock);
dentry->d_inode = NULL;
- dentry->d_parent = NULL;
- dentry->d_sb = NULL;
dentry->d_op = NULL;
dentry->d_fsdata = NULL;
dentry->d_mounted = 0;
@@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
if (parent) {
dentry->d_parent = dget(parent);
dentry->d_sb = parent->d_sb;
+ spin_lock(&dcache_lock);
+ list_add(&dentry->d_u.d_child, &parent->d_subdirs);
+ spin_unlock(&dcache_lock);
} else {
+ dentry->d_parent = NULL;
+ dentry->d_sb = NULL;
INIT_LIST_HEAD(&dentry->d_u.d_child);
}
-
- spin_lock(&dcache_lock);
- if (parent)
- list_add(&dentry->d_u.d_child, &parent->d_subdirs);
- dentry_stat.nr_dentry++;
- spin_unlock(&dcache_lock);
-
+ percpu_counter_inc(&nr_dentry);
return dentry;
}

@@ -2282,6 +2290,7 @@ static void __init dcache_init(void)
{
int loop;

+ percpu_counter_init(&nr_dentry, 0);
/*
* A constructor could be added for stable state like the lists,
* but it is probably not worth it because of the cache nature
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4a853ef..114cb65 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2217,6 +2217,8 @@ static inline void free_secdata(void *secdata)
struct ctl_table;
int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos);

int get_filesystem_list(char * buf);

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3d56fe7..777bee7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1246,7 +1246,7 @@ static struct ctl_table fs_table[] = {
.data = &dentry_stat,
.maxlen = 6*sizeof(int),
.mode = 0444,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_nr_dentry,
},
{
.ctl_name = FS_OVERFLOWUID,

2008-12-11 22:40:39

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : no difference at this point)

Signed-off-by: Eric Dumazet <[email protected]>
---
fs/fs-writeback.c | 2 +-
fs/inode.c | 39 +++++++++++++++++++++++++++++++--------
include/linux/fs.h | 3 +++
kernel/sysctl.c | 4 ++--
mm/page-writeback.c | 2 +-
5 files changed, 38 insertions(+), 12 deletions(-)


diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d0ff0b8..b591cdd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);

wbc.nr_to_write = nr_dirty + nr_unstable +
- (inodes_stat.nr_inodes - inodes_stat.nr_unused) +
+ (get_nr_inodes() - inodes_stat.nr_unused) +
nr_dirty + nr_unstable;
wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */
sync_sb_inodes(sb, &wbc);
diff --git a/fs/inode.c b/fs/inode.c
index 0487ddb..f94f889 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex);
* Statistics gathering..
*/
struct inodes_stat_t inodes_stat;
+static struct percpu_counter nr_inodes;

static struct kmem_cache * inode_cachep __read_mostly;

+int get_nr_inodes(void)
+{
+ return percpu_counter_sum_positive(&nr_inodes);
+}
+
+/*
+ * Handle nr_dentry sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ inodes_stat.nr_inodes = get_nr_inodes();
+ return proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ return -ENOSYS;
+}
+#endif
+
static void wake_up_inode(struct inode *inode)
{
/*
@@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head)
destroy_inode(inode);
nr_disposed++;
}
- spin_lock(&inode_lock);
- inodes_stat.nr_inodes -= nr_disposed;
- spin_unlock(&inode_lock);
+ percpu_counter_sub(&nr_inodes, nr_disposed);
}

/*
@@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb)

inode = alloc_inode(sb);
if (inode) {
+ percpu_counter_inc(&nr_inodes);
spin_lock(&inode_lock);
- inodes_stat.nr_inodes++;
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_sb_list, &sb->s_inodes);
inode->i_ino = ++last_ino;
@@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
if (set(inode, data))
goto set_failed;

- inodes_stat.nr_inodes++;
+ percpu_counter_inc(&nr_inodes);
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_sb_list, &sb->s_inodes);
hlist_add_head(&inode->i_hash, head);
@@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
old = find_inode_fast(sb, head, ino);
if (!old) {
inode->i_ino = ino;
- inodes_stat.nr_inodes++;
+ percpu_counter_inc(&nr_inodes);
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_sb_list, &sb->s_inodes);
hlist_add_head(&inode->i_hash, head);
@@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode)
list_del_init(&inode->i_list);
list_del_init(&inode->i_sb_list);
inode->i_state |= I_FREEING;
- inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
+ percpu_counter_dec(&nr_inodes);

security_inode_delete(inode);

@@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode)
list_del_init(&inode->i_list);
list_del_init(&inode->i_sb_list);
inode->i_state |= I_FREEING;
- inodes_stat.nr_inodes--;
spin_unlock(&inode_lock);
+ percpu_counter_dec(&nr_inodes);
if (inode->i_data.nrpages)
truncate_inode_pages(&inode->i_data, 0);
clear_inode(inode);
@@ -1394,6 +1416,7 @@ void __init inode_init(void)
{
int loop;

+ percpu_counter_init(&nr_inodes, 0);
/* inode slab cache */
inode_cachep = kmem_cache_create("inode_cache",
sizeof(struct inode),
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 114cb65..a789346 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -47,6 +47,7 @@ struct inodes_stat_t {
int dummy[5]; /* padding for sysctl ABI compatibility */
};
extern struct inodes_stat_t inodes_stat;
+extern int get_nr_inodes(void);

extern int leases_enable, lease_break_time;

@@ -2219,6 +2220,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos);
int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
void __user *buffer, size_t *lenp, loff_t *ppos);
+int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos);

int get_filesystem_list(char * buf);

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 777bee7..b705f3a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1205,7 +1205,7 @@ static struct ctl_table fs_table[] = {
.data = &inodes_stat,
.maxlen = 2*sizeof(int),
.mode = 0444,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_nr_inodes,
},
{
.ctl_name = FS_STATINODE,
@@ -1213,7 +1213,7 @@ static struct ctl_table fs_table[] = {
.data = &inodes_stat,
.maxlen = 7*sizeof(int),
.mode = 0444,
- .proc_handler = &proc_dointvec,
+ .proc_handler = &proc_nr_inodes,
},
{
.procname = "file-nr",
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2970e35..a71a922 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
next_jif = start_jif + dirty_writeback_interval;
nr_to_write = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS) +
- (inodes_stat.nr_inodes - inodes_stat.nr_unused);
+ (get_nr_inodes() - inodes_stat.nr_unused);
while (nr_to_write > 0) {
wbc.more_io = 0;
wbc.encountered_congestion = 0;

2008-12-11 22:41:48

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

("socketallocbench -n 8" bench result : from 25s to 19.9s)

Signed-off-by: Eric Dumazet <[email protected]>
---
fs/anon_inodes.c | 16 ------------
fs/dcache.c | 51 +++++++++++++++++++++++++++++++++++++++
fs/pipe.c | 23 +----------------
include/linux/dcache.h | 9 ++++++
net/socket.c | 24 +-----------------
5 files changed, 65 insertions(+), 58 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..8bf83cb 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags,
mnt);
}

-static int anon_inodefs_delete_dentry(struct dentry *dentry)
-{
- /*
- * We faked vfs to believe the dentry was hashed when we created it.
- * Now we restore the flag so that dput() will work correctly.
- */
- dentry->d_flags |= DCACHE_UNHASHED;
- return 1;
-}
-
static struct file_system_type anon_inode_fs_type = {
.name = "anon_inodefs",
.get_sb = anon_inodefs_get_sb,
.kill_sb = kill_anon_super,
};
static struct dentry_operations anon_inodefs_dentry_operations = {
- .d_delete = anon_inodefs_delete_dentry,
};

/**
@@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
this.name = name;
this.len = strlen(name);
this.hash = 0;
- dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+ dentry = d_alloc_single(&this, anon_inode_inode);
if (!dentry)
goto err_put_unused_fd;

@@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
atomic_inc(&anon_inode_inode->i_count);

dentry->d_op = &anon_inodefs_dentry_operations;
- /* Do not publish this dentry inside the global dentry hash table */
- dentry->d_flags &= ~DCACHE_UNHASHED;
- d_instantiate(dentry, anon_inode_inode);

error = -ENFILE;
file = alloc_file(anon_inode_mnt, dentry,
diff --git a/fs/dcache.c b/fs/dcache.c
index f463a81..af3bfb3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry)
*/

/*
+ * special version of dput() for pipes/sockets/anon.
+ * These dentries are not present in hash table, we can avoid
+ * taking/dirtying dcache_lock
+ */
+static void dput_single(struct dentry *dentry)
+{
+ struct inode *inode;
+
+ if (!atomic_dec_and_test(&dentry->d_count))
+ return;
+ inode = dentry->d_inode;
+ if (inode)
+ iput(inode);
+ d_free(dentry);
+}
+
+/*
* dput - release a dentry
* @dentry: dentry to release
*
@@ -234,6 +251,11 @@ void dput(struct dentry *dentry)
{
if (!dentry)
return;
+ /*
+ * single dentries (sockets/pipes/anon) fast path
+ */
+ if (dentry->d_flags & DCACHE_SINGLE)
+ return dput_single(dentry);

repeat:
if (atomic_read(&dentry->d_count) == 1)
@@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode)
return res;
}

+/**
+ * d_alloc_single - allocate SINGLE dentry
+ * @name: dentry name, given in a qstr structure
+ * @inode: inode to allocate the dentry for
+ *
+ * Allocate an SINGLE dentry for the inode given. The inode is
+ * instantiated and returned. %NULL is returned if there is insufficient
+ * memory.
+ * - SINGLE dentries have themselves as a parent.
+ * - SINGLE dentries are not hashed into global hash table
+ * - their d_alias list is empty
+ */
+struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
+{
+ struct dentry *entry;
+
+ entry = d_alloc(NULL, name);
+ if (entry) {
+ entry->d_sb = inode->i_sb;
+ entry->d_parent = entry;
+ entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
+ entry->d_inode = inode;
+ fsnotify_d_instantiate(entry, inode);
+ security_d_instantiate(entry, inode);
+ }
+ return entry;
+}
+
+
static inline struct hlist_head *d_hash(struct dentry *parent,
unsigned long hash)
{
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..4de6dd5 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode)
}

static struct vfsmount *pipe_mnt __read_mostly;
-static int pipefs_delete_dentry(struct dentry *dentry)
-{
- /*
- * At creation time, we pretended this dentry was hashed
- * (by clearing DCACHE_UNHASHED bit in d_flags)
- * At delete time, we restore the truth : not hashed.
- * (so that dput() can proceed correctly)
- */
- dentry->d_flags |= DCACHE_UNHASHED;
- return 0;
-}

/*
* pipefs_dname() is called from d_path().
@@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
}

static struct dentry_operations pipefs_dentry_operations = {
- .d_delete = pipefs_delete_dentry,
.d_dname = pipefs_dname,
};

@@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
struct inode *inode;
struct file *f;
struct dentry *dentry;
- struct qstr name = { .name = "" };
+ static const struct qstr name = { .name = "" };

err = -ENFILE;
inode = get_pipe_inode();
@@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags)
goto err;

err = -ENOMEM;
- dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+ dentry = d_alloc_single(&name, inode);
if (!dentry)
goto err_inode;

dentry->d_op = &pipefs_dentry_operations;
- /*
- * We dont want to publish this dentry into global dentry hash table.
- * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
- * This permits a working /proc/$pid/fd/XXX on pipes
- */
- dentry->d_flags &= ~DCACHE_UNHASHED;
- d_instantiate(dentry, inode);

err = -ENFILE;
f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index a37359d..ca8d269 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -176,6 +176,14 @@ d_iput: no no no yes
#define DCACHE_UNHASHED 0x0010

#define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched */
+#define DCACHE_SINGLE 0x0040
+ /*
+ * socket, pipe or anonymous fd dentry
+ * - SINGLE dentries have themselves as a parent.
+ * - SINGLE dentries are not hashed into global hash table
+ * - Their d_alias list is empty
+ * - They dont need dcache_lock synchronization
+ */

extern spinlock_t dcache_lock;
extern seqlock_t rename_lock;
@@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *);
extern void shrink_dcache_parent(struct dentry *);
extern void shrink_dcache_for_umount(struct super_block *);
extern int d_invalidate(struct dentry *);
+extern struct dentry *d_alloc_single(const struct qstr *, struct inode *);

/* only used at mount-time */
extern struct dentry * d_alloc_root(struct inode *);
diff --git a/net/socket.c b/net/socket.c
index 92764d8..353c928 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -308,18 +308,6 @@ static struct file_system_type sock_fs_type = {
.kill_sb = kill_anon_super,
};

-static int sockfs_delete_dentry(struct dentry *dentry)
-{
- /*
- * At creation time, we pretended this dentry was hashed
- * (by clearing DCACHE_UNHASHED bit in d_flags)
- * At delete time, we restore the truth : not hashed.
- * (so that dput() can proceed correctly)
- */
- dentry->d_flags |= DCACHE_UNHASHED;
- return 0;
-}
-
/*
* sockfs_dname() is called from d_path().
*/
@@ -330,7 +318,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
}

static struct dentry_operations sockfs_dentry_operations = {
- .d_delete = sockfs_delete_dentry,
.d_dname = sockfs_dname,
};

@@ -372,20 +359,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
{
struct dentry *dentry;
- struct qstr name = { .name = "" };
+ static const struct qstr name = { .name = "" };

- dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+ dentry = d_alloc_single(&name, SOCK_INODE(sock));
if (unlikely(!dentry))
return -ENOMEM;

dentry->d_op = &sockfs_dentry_operations;
- /*
- * We dont want to push this dentry into global dentry hash table.
- * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
- * This permits a working /proc/$pid/fd/XXX on sockets
- */
- dentry->d_flags &= ~DCACHE_UNHASHED;
- d_instantiate(dentry, SOCK_INODE(sock));

sock->file = file;
init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,

2008-12-11 22:41:18

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)

Signed-off-by: Eric Dumazet <[email protected]>
---
fs/inode.c | 35 ++++++++++++++++++++++++++++++++---
1 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index f94f889..dc8e72a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -556,6 +556,36 @@ repeat:
return node ? inode : NULL;
}

+#ifdef CONFIG_SMP
+/*
+ * Each cpu owns a range of 1024 numbers.
+ * 'shared_last_ino' is dirtied only once out of 1024 allocations,
+ * to renew the exhausted range.
+ */
+static DEFINE_PER_CPU(int, last_ino);
+
+static int last_ino_get(void)
+{
+ static atomic_t shared_last_ino;
+ int *p = &get_cpu_var(last_ino);
+ int res = *p;
+
+ if (unlikely((res & 1023) == 0))
+ res = atomic_add_return(1024, &shared_last_ino) - 1024;
+
+ *p = ++res;
+ put_cpu_var(last_ino);
+ return res;
+}
+#else
+static int last_ino_get(void)
+{
+ static int last_ino;
+
+ return ++last_ino;
+}
+#endif
+
/**
* new_inode - obtain an inode
* @sb: superblock
@@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
* error if st_ino won't fit in target struct field. Use 32bit counter
* here to attempt to avoid that.
*/
- static unsigned int last_ino;
struct inode * inode;

spin_lock_prefetch(&inode_lock);
@@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
inode = alloc_inode(sb);
if (inode) {
percpu_counter_inc(&nr_inodes);
+ inode->i_state = 0;
+ inode->i_ino = last_ino_get();
spin_lock(&inode_lock);
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_sb_list, &sb->s_inodes);
- inode->i_ino = ++last_ino;
- inode->i_state = 0;
spin_unlock(&inode_lock);
}
return inode;

2008-12-11 22:42:15

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v3 0/7] fs: Scalability of sockets/pipes allocation/deallocation on SMP

Hi Andrew

Take v2 of this patch serie got no new feedback, maybe its time for mm
inclusion for a while ?

In this third version I added last two patches, one intialy from Christoph
Lameter, and one to avoid dirtying mnt->mnt_count on hardwired fs.

Many thanks to Christoph and Paul for this SLAB_DESTROY_PER_RCU work done
on "struct file".

Thank you

Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 1.62 s, on a 8 cpus machine)

Long version :

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
but would be faster if 'struct file' were using SLAB_DESTROY_BY_RCU
and avoid call_rcu() cache killer). This point is addressed by 6th
patch.

1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list (for sockets/pipes, this is useless)
- dirties superblock s_inodes. - dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount


At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.

This patch series get rid of all but one contended cache lines for
sockets, pipes and anonymous fd (signalfd, timerfd, ...)

socketallocbench is a very simple program (attached to this mail) that makes
a loop :

for (i = 0; i < 1000000; i++)
close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real 1.561s
user 0.092s
sys 1.469s

Cost if 8 processes are launched on a 8 CPU machine
(socketallocbench -n 8) :

real 27.496s <<<< !!!! >>>>
user 0.657s
sys 3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples cum. samples % cum. % symbol name
3347352 3347352 28.0232 28.0232 _atomic_dec_and_lock
3301428 6648780 27.6388 55.6620 d_instantiate
2971130 9619910 24.8736 80.5355 d_alloc
241318 9861228 2.0203 82.5558 init_file
146190 10007418 1.2239 83.7797 __slab_free
144149 10151567 1.2068 84.9864 inotify_d_instantiate
143971 10295538 1.2053 86.1917 inet_create
137168 10432706 1.1483 87.3401 new_inode
117549 10550255 0.9841 88.3242 add_partial
110795 10661050 0.9275 89.2517 generic_drop_inode
107137 10768187 0.8969 90.1486 kmem_cache_alloc
94029 10862216 0.7872 90.9358 tcp_close
82837 10945053 0.6935 91.6293 dput
67486 11012539 0.5650 92.1943 dentry_iput
57751 11070290 0.4835 92.6778 iput
54327 11124617 0.4548 93.1326 tcp_v4_init_sock
49921 11174538 0.4179 93.5505 sysenter_past_esp
47616 11222154 0.3986 93.9491 kmem_cache_free
30792 11252946 0.2578 94.2069 clear_inode
27540 11280486 0.2306 94.4375 copy_from_user
26509 11306995 0.2219 94.6594 init_timer
26363 11333358 0.2207 94.8801 discard_slab
25284 11358642 0.2117 95.0918 __fput
22482 11381124 0.1882 95.2800 __percpu_counter_add
20369 11401493 0.1705 95.4505 sock_alloc
18501 11419994 0.1549 95.6054 inet_csk_destroy_sock
17923 11437917 0.1500 95.7555 sys_close


This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.


New cost if run on one cpu :

real 1.245s (instead of 1.561s)
user 0.074s
sys 1.161s


If run on 8 CPUS :

real 1.624s
user 0.580s
sys 12.296s


On oprofile, we finally can see network stuff coming at the front of
expensive stuff. (with the exception of kmem_cache_[z]alloc(), because
it has to clear 192 bytes of file structures, this takes half of the time)

CPU: Core 2, speed 3000.09 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples cum. samples % cum. % symbol name
176586 176586 10.9376 10.9376 kmem_cache_alloc
169838 346424 10.5196 21.4572 tcp_close
105331 451755 6.5241 27.9813 tcp_v4_init_sock
105146 556901 6.5126 34.4939 tcp_v4_destroy_sock
83307 640208 5.1600 39.6539 sysenter_past_esp
80241 720449 4.9701 44.6239 inet_csk_destroy_sock
74263 794712 4.5998 49.2237 kmem_cache_free
56806 851518 3.5185 52.7422 __percpu_counter_add
48619 900137 3.0114 55.7536 copy_from_user
44803 944940 2.7751 58.5287 init_timer
28539 973479 1.7677 60.2964 d_alloc
27795 1001274 1.7216 62.0180 alloc_fd
26747 1028021 1.6567 63.6747 __fput
24312 1052333 1.5059 65.1805 sys_close
24205 1076538 1.4992 66.6798 inet_create
22409 1098947 1.3880 68.0677 alloc_inode
21359 1120306 1.3230 69.3907 release_sock
19865 1140171 1.2304 70.6211 fd_install
19472 1159643 1.2061 71.8272 lock_sock_nested
18956 1178599 1.1741 73.0013 sock_init_data
17301 1195900 1.0716 74.0729 drop_file_write_access
17113 1213013 1.0600 75.1329 inotify_d_instantiate
16384 1229397 1.0148 76.1477 dput
15173 1244570 0.9398 77.0875 local_bh_enable_ip
15017 1259587 0.9301 78.0176 local_bh_enable
13354 1272941 0.8271 78.8448 __sock_create
13139 1286080 0.8138 79.6586 inet_release
13062 1299142 0.8090 80.4676 sysenter_do_call
11935 1311077 0.7392 81.2069 iput_single


This patch serie contains 7 patches, against linux-2.6 tree,
plus one patch in mm (fs: filp_cachep can be static in fs/file_table.c)

[PATCH 1/7] fs: Use a percpu_counter to track nr_dentry

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

("socketallocbench -n 8" bench result : 27.5s to 25s)

[PATCH 2/7] fs: Use a percpu_counter to track nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

("socketallocbench -n 8" bench result : no difference at this point)

[PATCH 3/7] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 232 allocations)

("socketallocbench -n 8" result : no difference)


[PATCH 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

[PATCH 5/7] fs: new_inode_single() and iput_single()

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid taking it as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines
in iput()

("socketallocbench -n 8" result : from 19.9s to 3.01s)


[PATH 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

From: Christoph Lameter <[email protected]>

Currently we schedule RCU frees for each file we free separately. That has
several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
did not require RCU callbacks:

1. Excessive number of RCU callbacks can be generated causing long RCU
queues that in turn cause long latencies. We hit SLUB page allocation
more often than necessary.

2. The cache hot object is not preserved between free and realloc. A close
followed by another open is very fast with the RCUless approach because
the last freed object is returned by the slab allocator that is
still cache hot. RCU free means that the object is not immediately
available again. The new object is cache cold and therefore open/close
performance tests show a significant degradation with the RCU
implementation.

One solution to this problem is to move the RCU freeing into the Slab
allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
time. The slab allocator will do RCU frees only when it is necessary
to dispose of slabs of objects (rare). So with that approach we can cut
out the RCU overhead significantly.

However, the slab allocator may return the object for another use even
before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
there is the (unlikely) possibility that the object is going to be
switched under us in sections protected by rcu_read_lock() and
rcu_read_unlock(). So we need to verify that we have acquired the correct
object after establishing a stable object reference (incrementing the
refcounter does that).


Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

("socketallocbench -n 8" result : from 3.01s to 2.20s)

[PATCH 7/7] fs: MS_NOREFCOUNT

Some fs are hardwired into kernel, and mntput()/mntget() hit a contended
cache line. We define a new superblock flag, MS_NOREFCOUNT, that is set
on socket, pipes and anonymous fd superblocks. mntput()/mntget() become
null ops on these fs.

("socketallocbench -n 8" result : from 2.20s to 1.64s)

cat socketallocbench.c
/*
* socketallocbench benchmark
*
* Usage : socket [-n procs] [-l loops]
*/
#include <sys/socket.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/wait.h>

void dowork(int loops)
{
int i;

for (i = 0; i < loops; i++)
close(socket(AF_INET, SOCK_STREAM, 0));
}

int main(int argc, char *argv[])
{
int i;
int n = 1;
int loops = 1000000;
pid_t *pidtable;

while ((i = getopt(argc, argv, "n:l:")) != EOF) {
if (i == 'n')
n = atoi(optarg);
if (i == 'l')
loops = atoi(optarg);
}
pidtable = malloc(n * sizeof(pid_t));
for (i = 1; i < n; i++) {
pidtable[i] = fork();
if (pidtable[i] == 0) {
dowork(loops);
_exit(0);
}
if (pidtable[i] == -1) {
perror("fork");
n = i;
break;
}
}
dowork(loops);
for (i = 1; i < n; i++) {
int status;

wait(&status);
}
return 0;
}

2008-12-11 22:42:53

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

From: Christoph Lameter <[email protected]>

[PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

Currently we schedule RCU frees for each file we free separately. That has
several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
did not require RCU callbacks:

1. Excessive number of RCU callbacks can be generated causing long RCU
queues that in turn cause long latencies. We hit SLUB page allocation
more often than necessary.

2. The cache hot object is not preserved between free and realloc. A close
followed by another open is very fast with the RCUless approach because
the last freed object is returned by the slab allocator that is
still cache hot. RCU free means that the object is not immediately
available again. The new object is cache cold and therefore open/close
performance tests show a significant degradation with the RCU
implementation.

One solution to this problem is to move the RCU freeing into the Slab
allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
time. The slab allocator will do RCU frees only when it is necessary
to dispose of slabs of objects (rare). So with that approach we can cut
out the RCU overhead significantly.

However, the slab allocator may return the object for another use even
before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
there is the (unlikely) possibility that the object is going to be
switched under us in sections protected by rcu_read_lock() and
rcu_read_unlock(). So we need to verify that we have acquired the correct
object after establishing a stable object reference (incrementing the
refcounter does that).


Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/filesystems/files.txt | 21 ++++++++++++++--
fs/file_table.c | 33 ++++++++++++++++++--------
include/linux/fs.h | 5 ---
3 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt
index ac2facc..6916baa 100644
--- a/Documentation/filesystems/files.txt
+++ b/Documentation/filesystems/files.txt
@@ -78,13 +78,28 @@ the fdtable structure -
that look-up may race with the last put() operation on the
file structure. This is avoided using atomic_long_inc_not_zero()
on ->f_count :
+ As file structures are allocated with SLAB_DESTROY_BY_RCU,
+ they can also be freed before a RCU grace period, and reused,
+ but still as a struct file.
+ It is necessary to check again after getting
+ a stable reference (ie after atomic_long_inc_not_zero()),
+ that fcheck_files(files, fd) points to the same file.

rcu_read_lock();
file = fcheck_files(files, fd);
if (file) {
- if (atomic_long_inc_not_zero(&file->f_count))
+ if (atomic_long_inc_not_zero(&file->f_count)) {
*fput_needed = 1;
- else
+ /*
+ * Now we have a stable reference to an object.
+ * Check if other threads freed file and reallocated it.
+ */
+ if (file != fcheck_files(files, fd)) {
+ *fput_needed = 0;
+ put_filp(file);
+ file = NULL;
+ }
+ } else
/* Didn't get the reference, someone's freed */
file = NULL;
}
@@ -95,6 +110,8 @@ the fdtable structure -
atomic_long_inc_not_zero() detects if refcounts is already zero or
goes to zero during increment. If it does, we fail
fget()/fget_light().
+ The second call to fcheck_files(files, fd) checks that this filp
+ was not freed, then reused by an other thread.

6. Since both fdtable and file structures can be looked up
lock-free, they must be installed using rcu_assign_pointer()
diff --git a/fs/file_table.c b/fs/file_table.c
index a46e880..3e9259d 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;

static struct percpu_counter nr_files __cacheline_aligned_in_smp;

-static inline void file_free_rcu(struct rcu_head *head)
-{
- struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
- kmem_cache_free(filp_cachep, f);
-}
-
static inline void file_free(struct file *f)
{
percpu_counter_dec(&nr_files);
file_check_state(f);
- call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
+ kmem_cache_free(filp_cachep, f);
}

/*
@@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
rcu_read_unlock();
return NULL;
}
+ /*
+ * Now we have a stable reference to an object.
+ * Check if other threads freed file and re-allocated it.
+ */
+ if (unlikely(file != fcheck_files(files, fd))) {
+ put_filp(file);
+ file = NULL;
+ }
}
rcu_read_unlock();

@@ -333,9 +335,19 @@ struct file *fget_light(unsigned int fd, int *fput_needed)
rcu_read_lock();
file = fcheck_files(files, fd);
if (file) {
- if (atomic_long_inc_not_zero(&file->f_count))
+ if (atomic_long_inc_not_zero(&file->f_count)) {
*fput_needed = 1;
- else
+ /*
+ * Now we have a stable reference to an object.
+ * Check if other threads freed this file and
+ * re-allocated it.
+ */
+ if (unlikely(file != fcheck_files(files, fd))) {
+ *fput_needed = 0;
+ put_filp(file);
+ file = NULL;
+ }
+ } else
/* Didn't get the reference, someone's freed */
file = NULL;
}
@@ -402,7 +414,8 @@ void __init files_init(unsigned long mempages)
int n;

filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
- SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+ SLAB_HWCACHE_ALIGN | SLAB_DESTROY_BY_RCU | SLAB_PANIC,
+ NULL);

/*
* One file with associated inode and dcache is very roughly 1K.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a702d81..a1f56d4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -811,13 +811,8 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
#define FILE_MNT_WRITE_RELEASED 2

struct file {
- /*
- * fu_list becomes invalid after file_free is called and queued via
- * fu_rcuhead for RCU freeing
- */
union {
struct list_head fu_list;
- struct rcu_head fu_rcuhead;
} f_u;
struct path f_path;
#define f_dentry f_path.dentry

2008-12-11 22:42:34

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v3 5/7] fs: new_inode_single() and iput_single()

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid taking it
as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of
three contended cache lines in new_inode(), and five cache lines in iput()

("socketallocbench -n 8" result : from 19.9s to 3.01s)

Signed-off-by: Eric Dumazet <[email protected]>
---
fs/anon_inodes.c | 2 +-
fs/dcache.c | 2 +-
fs/inode.c | 29 ++++++++++++++++++++---------
fs/pipe.c | 2 +-
include/linux/fs.h | 12 +++++++++++-
net/socket.c | 2 +-
6 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 8bf83cb..89fd36d 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
*/
static struct inode *anon_inode_mkinode(void)
{
- struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
+ struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb);

if (!inode)
return ERR_PTR(-ENOMEM);
diff --git a/fs/dcache.c b/fs/dcache.c
index af3bfb3..3363853 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry)
return;
inode = dentry->d_inode;
if (inode)
- iput(inode);
+ iput_single(inode);
d_free(dentry);
}

diff --git a/fs/inode.c b/fs/inode.c
index dc8e72a..0fdfe1b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode)
kmem_cache_free(inode_cachep, (inode));
}

+void iput_single(struct inode *inode)
+{
+ if (atomic_dec_and_test(&inode->i_count)) {
+ destroy_inode(inode);
+ percpu_counter_dec(&nr_inodes);
+ }
+}

/*
* These are initializations that only need to be done
@@ -587,8 +594,9 @@ static int last_ino_get(void)
#endif

/**
- * new_inode - obtain an inode
+ * __new_inode - obtain an inode
* @sb: superblock
+ * @single: if true, dont link new inode in a list
*
* Allocates a new inode for given superblock. The default gfp_mask
* for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
@@ -598,7 +606,7 @@ static int last_ino_get(void)
* newly created inode's mapping
*
*/
-struct inode *new_inode(struct super_block *sb)
+struct inode *__new_inode(struct super_block *sb, int single)
{
/*
* On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
@@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb)
*/
struct inode * inode;

- spin_lock_prefetch(&inode_lock);
-
inode = alloc_inode(sb);
if (inode) {
percpu_counter_inc(&nr_inodes);
inode->i_state = 0;
inode->i_ino = last_ino_get();
- spin_lock(&inode_lock);
- list_add(&inode->i_list, &inode_in_use);
- list_add(&inode->i_sb_list, &sb->s_inodes);
- spin_unlock(&inode_lock);
+ if (single) {
+ INIT_LIST_HEAD(&inode->i_list);
+ INIT_LIST_HEAD(&inode->i_sb_list);
+ } else {
+ spin_lock(&inode_lock);
+ list_add(&inode->i_list, &inode_in_use);
+ list_add(&inode->i_sb_list, &sb->s_inodes);
+ spin_unlock(&inode_lock);
+ }
}
return inode;
}

-EXPORT_SYMBOL(new_inode);
+EXPORT_SYMBOL(__new_inode);

void unlock_new_inode(struct inode *inode)
{
diff --git a/fs/pipe.c b/fs/pipe.c
index 4de6dd5..8c51a0d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = {

static struct inode * get_pipe_inode(void)
{
- struct inode *inode = new_inode(pipe_mnt->mnt_sb);
+ struct inode *inode = new_inode_single(pipe_mnt->mnt_sb);
struct pipe_inode_info *pipe;

if (!inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a789346..a702d81 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1899,7 +1899,17 @@ extern void __iget(struct inode * inode);
extern void iget_failed(struct inode *);
extern void clear_inode(struct inode *);
extern void destroy_inode(struct inode *);
-extern struct inode *new_inode(struct super_block *);
+extern struct inode *__new_inode(struct super_block *, int);
+static inline struct inode *new_inode(struct super_block *sb)
+{
+ return __new_inode(sb, 0);
+}
+static inline struct inode *new_inode_single(struct super_block *sb)
+{
+ return __new_inode(sb, 1);
+}
+extern void iput_single(struct inode *);
+
extern int should_remove_suid(struct dentry *);
extern int file_remove_suid(struct file *);

diff --git a/net/socket.c b/net/socket.c
index 353c928..4017409 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -464,7 +464,7 @@ static struct socket *sock_alloc(void)
struct inode *inode;
struct socket *sock;

- inode = new_inode(sock_mnt->mnt_sb);
+ inode = new_inode_single(sock_mnt->mnt_sb);
if (!inode)
return NULL;

2008-12-11 22:43:29

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v3 7/7] fs: MS_NOREFCOUNT

Some fs are hardwired into kernel, and mntput()/mntget() hit a contended
cache line. We define a new superblock flag, MS_NOREFCOUNT, that is set
on socket, pipes and anonymous fd superblocks. mntput()/mntget() become
null ops on these fs.

("socketallocbench -n 8" result : from 2.20s to 1.64s)

Signed-off-by: Eric Dumazet <[email protected]>
---
fs/anon_inodes.c | 1 +
fs/pipe.c | 3 ++-
include/linux/fs.h | 2 ++
include/linux/mount.h | 8 +++-----
net/socket.c | 1 +
5 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 89fd36d..de0ec3b 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -158,6 +158,7 @@ static int __init anon_inode_init(void)
error = PTR_ERR(anon_inode_mnt);
goto err_unregister_filesystem;
}
+ anon_inode_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT;
anon_inode_inode = anon_inode_mkinode();
if (IS_ERR(anon_inode_inode)) {
error = PTR_ERR(anon_inode_inode);
diff --git a/fs/pipe.c b/fs/pipe.c
index 8c51a0d..f547432 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1078,7 +1078,8 @@ static int __init init_pipe_fs(void)
if (IS_ERR(pipe_mnt)) {
err = PTR_ERR(pipe_mnt);
unregister_filesystem(&pipe_fs_type);
- }
+ } else
+ pipe_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT;
}
return err;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a1f56d4..11b0452 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -137,6 +137,8 @@ extern int dir_notify_enable;
#define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */
#define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */
#define MS_I_VERSION (1<<23) /* Update inode I_version field */
+
+#define MS_NOREFCOUNT (1<<29) /* kernel static mnt : no refcounting needed */
#define MS_ACTIVE (1<<30)
#define MS_NOUSER (1<<31)

diff --git a/include/linux/mount.h b/include/linux/mount.h
index cab2a85..51418b5 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -14,10 +14,8 @@
#include <linux/nodemask.h>
#include <linux/spinlock.h>
#include <asm/atomic.h>
+#include <linux/fs.h>

-struct super_block;
-struct vfsmount;
-struct dentry;
struct mnt_namespace;

#define MNT_NOSUID 0x01
@@ -73,7 +71,7 @@ struct vfsmount {

static inline struct vfsmount *mntget(struct vfsmount *mnt)
{
- if (mnt)
+ if (mnt && !(mnt->mnt_sb->s_flags & MS_NOREFCOUNT))
atomic_inc(&mnt->mnt_count);
return mnt;
}
@@ -87,7 +85,7 @@ extern int __mnt_is_readonly(struct vfsmount *mnt);

static inline void mntput(struct vfsmount *mnt)
{
- if (mnt) {
+ if (mnt && !(mnt->mnt_sb->s_flags & MS_NOREFCOUNT)) {
mnt->mnt_expiry_mark = 0;
mntput_no_expire(mnt);
}
diff --git a/net/socket.c b/net/socket.c
index 4017409..2534dbc 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2206,6 +2206,7 @@ static int __init sock_init(void)
init_inodecache();
register_filesystem(&sock_fs_type);
sock_mnt = kern_mount(&sock_fs_type);
+ sock_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT;

/* The real protocol initialization is performed in later initcalls.
*/

2008-12-12 02:50:38

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

On Tuesday 24 July 2007 11:13, Nick Piggin wrote:
> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
> > From: Christoph Lameter <[email protected]>
> >
> > [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
> >
> > Currently we schedule RCU frees for each file we free separately. That
> > has several drawbacks against the earlier file handling (in 2.6.5 f.e.),
> > which did not require RCU callbacks:
> >
> > 1. Excessive number of RCU callbacks can be generated causing long RCU
> > queues that in turn cause long latencies. We hit SLUB page allocation
> > more often than necessary.
> >
> > 2. The cache hot object is not preserved between free and realloc. A
> > close followed by another open is very fast with the RCUless approach
> > because the last freed object is returned by the slab allocator that is
> > still cache hot. RCU free means that the object is not immediately
> > available again. The new object is cache cold and therefore open/close
> > performance tests show a significant degradation with the RCU
> > implementation.
> >
> > One solution to this problem is to move the RCU freeing into the Slab
> > allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
> > time. The slab allocator will do RCU frees only when it is necessary
> > to dispose of slabs of objects (rare). So with that approach we can cut
> > out the RCU overhead significantly.
> >
> > However, the slab allocator may return the object for another use even
> > before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
> > there is the (unlikely) possibility that the object is going to be
> > switched under us in sections protected by rcu_read_lock() and
> > rcu_read_unlock(). So we need to verify that we have acquired the correct
> > object after establishing a stable object reference (incrementing the
> > refcounter does that).
> >
> >
> > Signed-off-by: Christoph Lameter <[email protected]>
> > Signed-off-by: Eric Dumazet <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > Documentation/filesystems/files.txt | 21 ++++++++++++++--
> > fs/file_table.c | 33 ++++++++++++++++++--------
> > include/linux/fs.h | 5 ---
> > 3 files changed, 42 insertions(+), 17 deletions(-)
> >
> > diff --git a/Documentation/filesystems/files.txt
> > b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
> > --- a/Documentation/filesystems/files.txt
> > +++ b/Documentation/filesystems/files.txt
> > @@ -78,13 +78,28 @@ the fdtable structure -
> > that look-up may race with the last put() operation on the
> > file structure. This is avoided using atomic_long_inc_not_zero()
> > on ->f_count :
> > + As file structures are allocated with SLAB_DESTROY_BY_RCU,
> > + they can also be freed before a RCU grace period, and reused,
> > + but still as a struct file.
> > + It is necessary to check again after getting
> > + a stable reference (ie after atomic_long_inc_not_zero()),
> > + that fcheck_files(files, fd) points to the same file.
> >
> > rcu_read_lock();
> > file = fcheck_files(files, fd);
> > if (file) {
> > - if (atomic_long_inc_not_zero(&file->f_count))
> > + if (atomic_long_inc_not_zero(&file->f_count)) {
> > *fput_needed = 1;
> > - else
> > + /*
> > + * Now we have a stable reference to an object.
> > + * Check if other threads freed file and reallocated it.
> > + */
> > + if (file != fcheck_files(files, fd)) {
> > + *fput_needed = 0;
> > + put_filp(file);
> > + file = NULL;
> > + }
> > + } else
> > /* Didn't get the reference, someone's freed */
> > file = NULL;
> > }
> > @@ -95,6 +110,8 @@ the fdtable structure -
> > atomic_long_inc_not_zero() detects if refcounts is already zero or
> > goes to zero during increment. If it does, we fail
> > fget()/fget_light().
> > + The second call to fcheck_files(files, fd) checks that this filp
> > + was not freed, then reused by an other thread.
> >
> > 6. Since both fdtable and file structures can be looked up
> > lock-free, they must be installed using rcu_assign_pointer()
> > diff --git a/fs/file_table.c b/fs/file_table.c
> > index a46e880..3e9259d 100644
> > --- a/fs/file_table.c
> > +++ b/fs/file_table.c
> > @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
> >
> > static struct percpu_counter nr_files __cacheline_aligned_in_smp;
> >
> > -static inline void file_free_rcu(struct rcu_head *head)
> > -{
> > - struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
> > - kmem_cache_free(filp_cachep, f);
> > -}
> > -
> > static inline void file_free(struct file *f)
> > {
> > percpu_counter_dec(&nr_files);
> > file_check_state(f);
> > - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> > + kmem_cache_free(filp_cachep, f);
> > }
> >
> > /*
> > @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
> > rcu_read_unlock();
> > return NULL;
> > }
> > + /*
> > + * Now we have a stable reference to an object.
> > + * Check if other threads freed file and re-allocated it.
> > + */
> > + if (unlikely(file != fcheck_files(files, fd))) {
> > + put_filp(file);
> > + file = NULL;
> > + }
>
> This is a non-trivial change, because that put_filp may drop the last
> reference to the file. So now we have the case where we free the file
> from a context in which it had never been allocated.
>
> From a quick glance though the callchains, I can't seen an obvious
> problem. But it needs to have documentation in put_filp, or at least
> a mention in the changelog, and also cc'ed to the security lists.
>
> Also, it adds code and cost to the get/put path in return for
> improvement in the free path. get/put is the more common path, but
> it is a small loss for a big improvement. So it might be worth it. But
> it is not justified by your microbenchmark. Do we have a more useful
> case that it helps?

Sorry, my clock screwed up and I didn't notice :(

2008-12-12 04:46:44

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

Nick Piggin a ?crit :
> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
>> From: Christoph Lameter <[email protected]>
>>
>> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
>>
>> Currently we schedule RCU frees for each file we free separately. That has
>> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
>> did not require RCU callbacks:
>>
>> 1. Excessive number of RCU callbacks can be generated causing long RCU
>> queues that in turn cause long latencies. We hit SLUB page allocation
>> more often than necessary.
>>
>> 2. The cache hot object is not preserved between free and realloc. A close
>> followed by another open is very fast with the RCUless approach because
>> the last freed object is returned by the slab allocator that is
>> still cache hot. RCU free means that the object is not immediately
>> available again. The new object is cache cold and therefore open/close
>> performance tests show a significant degradation with the RCU
>> implementation.
>>
>> One solution to this problem is to move the RCU freeing into the Slab
>> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
>> time. The slab allocator will do RCU frees only when it is necessary
>> to dispose of slabs of objects (rare). So with that approach we can cut
>> out the RCU overhead significantly.
>>
>> However, the slab allocator may return the object for another use even
>> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
>> there is the (unlikely) possibility that the object is going to be
>> switched under us in sections protected by rcu_read_lock() and
>> rcu_read_unlock(). So we need to verify that we have acquired the correct
>> object after establishing a stable object reference (incrementing the
>> refcounter does that).
>>
>>
>> Signed-off-by: Christoph Lameter <[email protected]>
>> Signed-off-by: Eric Dumazet <[email protected]>
>> Signed-off-by: Paul E. McKenney <[email protected]>
>> ---
>> Documentation/filesystems/files.txt | 21 ++++++++++++++--
>> fs/file_table.c | 33 ++++++++++++++++++--------
>> include/linux/fs.h | 5 ---
>> 3 files changed, 42 insertions(+), 17 deletions(-)
>>
>> diff --git a/Documentation/filesystems/files.txt
>> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
>> --- a/Documentation/filesystems/files.txt
>> +++ b/Documentation/filesystems/files.txt
>> @@ -78,13 +78,28 @@ the fdtable structure -
>> that look-up may race with the last put() operation on the
>> file structure. This is avoided using atomic_long_inc_not_zero()
>> on ->f_count :
>> + As file structures are allocated with SLAB_DESTROY_BY_RCU,
>> + they can also be freed before a RCU grace period, and reused,
>> + but still as a struct file.
>> + It is necessary to check again after getting
>> + a stable reference (ie after atomic_long_inc_not_zero()),
>> + that fcheck_files(files, fd) points to the same file.
>>
>> rcu_read_lock();
>> file = fcheck_files(files, fd);
>> if (file) {
>> - if (atomic_long_inc_not_zero(&file->f_count))
>> + if (atomic_long_inc_not_zero(&file->f_count)) {
>> *fput_needed = 1;
>> - else
>> + /*
>> + * Now we have a stable reference to an object.
>> + * Check if other threads freed file and reallocated it.
>> + */
>> + if (file != fcheck_files(files, fd)) {
>> + *fput_needed = 0;
>> + put_filp(file);
>> + file = NULL;
>> + }
>> + } else
>> /* Didn't get the reference, someone's freed */
>> file = NULL;
>> }
>> @@ -95,6 +110,8 @@ the fdtable structure -
>> atomic_long_inc_not_zero() detects if refcounts is already zero or
>> goes to zero during increment. If it does, we fail
>> fget()/fget_light().
>> + The second call to fcheck_files(files, fd) checks that this filp
>> + was not freed, then reused by an other thread.
>>
>> 6. Since both fdtable and file structures can be looked up
>> lock-free, they must be installed using rcu_assign_pointer()
>> diff --git a/fs/file_table.c b/fs/file_table.c
>> index a46e880..3e9259d 100644
>> --- a/fs/file_table.c
>> +++ b/fs/file_table.c
>> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
>>
>> static struct percpu_counter nr_files __cacheline_aligned_in_smp;
>>
>> -static inline void file_free_rcu(struct rcu_head *head)
>> -{
>> - struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
>> - kmem_cache_free(filp_cachep, f);
>> -}
>> -
>> static inline void file_free(struct file *f)
>> {
>> percpu_counter_dec(&nr_files);
>> file_check_state(f);
>> - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
>> + kmem_cache_free(filp_cachep, f);
>> }
>>
>> /*
>> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
>> rcu_read_unlock();
>> return NULL;
>> }
>> + /*
>> + * Now we have a stable reference to an object.
>> + * Check if other threads freed file and re-allocated it.
>> + */
>> + if (unlikely(file != fcheck_files(files, fd))) {
>> + put_filp(file);
>> + file = NULL;
>> + }
>
> This is a non-trivial change, because that put_filp may drop the last
> reference to the file. So now we have the case where we free the file
> from a context in which it had never been allocated.

If we got at this point, we :

Found a non NULL pointer in our fd table.
Then, another thread came, closed the file while we not yet added our reference.
This file was freed (kmem_cache_free(filp_cachep, file))
This file was reused and inserted on another thread fd table.
We added our reference on refcount.
We checked if this file is still ours (in our fd tab).
We found this file is not anymore the file we wanted.
Calling put_filp() here is our only choice to safely remove the reference on
a truly allocated file. At this point the file is
a truly allocated file but not anymore ours.
Unfortunatly we added a reference on it : we must release it.
If the other thread already called put_filp() because it wanted to close its new file,
we must see f_refcnt going to zero, and we must call __fput(), to perform
all the relevant file cleanup ourself.


>
>>From a quick glance though the callchains, I can't seen an obvious
> problem. But it needs to have documentation in put_filp, or at least
> a mention in the changelog, and also cc'ed to the security lists.

I see your point. But currently, any thread can be "releasing the last
reference on a file". That is not always the thread that called close(fd)
We extend this to "any thread of any process", so it might have
a security effect you are absolutely right.

>
> Also, it adds code and cost to the get/put path in return for
> improvement in the free path. get/put is the more common path, but
> it is a small loss for a big improvement. So it might be worth it. But
> it is not justified by your microbenchmark. Do we have a more useful
> case that it helps?

Any real world program that open and close files, or said better,
that close and open files :)

sizeof(struct file) is 192 bytes. Thats three cache lines.
Being able to reuse a hot "struct file" avoids three cache line misses.

Thats about 120 ns.

Then, using call_rcu() is also a latency killer, since we explicitly say :
I dont want to free this file right now, I delegate this job to another layer
in two or three milli second (or more)

A final point is that SLUB doesnt need to allocate or free a slab in many cases.
(This is probably why Christoph needed this patch in 2006 :) )
In my case, I need all these patches to speedup http servers.
They obviously open and close many files per second.

The added code has a cost of less than 3 ns, but I suspect we can cut it to less than 1ns
We prefered with Christoph and Paul to keep patch as short as possible to focus
on essential points.

:c0287656: mov -0x14(%ebp),%esi
:c0287659: mov -0x24(%ebp),%edi
:c028765c: mov 0x4(%esi),%eax
:c028765f: cmp (%eax),%edi
:c0287661: jb c0287678 <fget+0xc8>
:c0287663: mov %ebx,%eax
:c0287665: xor %ebx,%ebx
:c0287667: call c0287450 <put_filp>
:c028766c: jmp c02875ec <fget+0x3c>
:c0287671: lea 0x0(%esi,%eiz,1),%esi
:c0287678: mov 0x4(%eax),%edi
:c028767b: add %edi,-0x10(%ebp)
:c028767e: mov -0x10(%ebp),%edx
1 8.8e-05 :c0287681: mov (%edx),%eax
:c0287683: cmp %eax,%ebx
:c0287685: je c02875ec <fget+0x3c>
:c028768b: jmp c0287663 <fget+0xb3>

We could avoid doing the full test, because there is no way the files->max_fds could
become lower under us, or even fdt itself, and fdt->fd

So instead of using twice this function :

static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd)
{
struct file * file = NULL;
struct fdtable *fdt = files_fdtable(files);

if (fd < fdt->max_fds)
file = rcu_dereference(fdt->fd[fd]);
return file;
}

We could use the attached patch


This becomes a matter of three instructions, including a 99.99% predicted branch :

c0287646: 8b 03 mov (%ebx),%eax
c0287648: 39 45 e4 cmp %eax,-0x1c(%ebp)
c028764b: 74 a1 je c02875ee <fget+0x3e>

c028764d: 8b 45 e4 mov -0x1c(%ebp),%eax
c0287650: e8 fb fd ff ff call c0287450 <put_filp>
c0287655: 31 c0 xor %eax,%eax
c0287657: eb 98 jmp c02875f1 <fget+0x41>


At the time Christoph sent its patch (in 2006), nobody cared, because
we had no benchmark or real world workload that demonstrated the gain
of his patch, only intuitions.
We had too many contended cache lines that slow down the whole process.

SLAB_DESTROY_BY_RCU is a must on current hardware, where memory cache line
misses costs become really problematic. This patch series clearly demonstrate
it.

Thanks Nick for your feedback and comments.

Eric

[PATCH] fs: optimize fget() & fget_light()

Instead of calling fcheck_files() a second time, we can take into account we
already did part of the job, in a rcu read locked section. We need a
struct file **filp pointer so that we only dereference it a second time.

Signed-off-by: Eric Dumazet <[email protected]>
---
fs/file_table.c | 23 +++++++++++++++++------
1 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 3e9259d..4bc019f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -289,11 +289,16 @@ void __fput(struct file *file)

struct file *fget(unsigned int fd)
{
- struct file *file;
+ struct file *file = NULL, **filp;
struct files_struct *files = current->files;
+ struct fdtable *fdt;

rcu_read_lock();
- file = fcheck_files(files, fd);
+ fdt = files_fdtable(files);
+ if (likely(fd < fdt->max_fds)) {
+ filp = &fdt->fd[fd];
+ file = rcu_dereference(*filp);
+ }
if (file) {
if (!atomic_long_inc_not_zero(&file->f_count)) {
/* File object ref couldn't be taken */
@@ -304,7 +309,7 @@ struct file *fget(unsigned int fd)
* Now we have a stable reference to an object.
* Check if other threads freed file and re-allocated it.
*/
- if (unlikely(file != fcheck_files(files, fd))) {
+ if (unlikely(file != rcu_dereference(*filp))) {
put_filp(file);
file = NULL;
}
@@ -325,15 +330,21 @@ EXPORT_SYMBOL(fget);
*/
struct file *fget_light(unsigned int fd, int *fput_needed)
{
- struct file *file;
+ struct file *file, **filp;
struct files_struct *files = current->files;
+ struct fdtable *fdt;

*fput_needed = 0;
if (likely((atomic_read(&files->count) == 1))) {
file = fcheck_files(files, fd);
} else {
rcu_read_lock();
- file = fcheck_files(files, fd);
+ fdt = files_fdtable(files);
+ file = NULL;
+ if (likely(fd < fdt->max_fds)) {
+ filp = &fdt->fd[fd];
+ file = rcu_dereference(*filp);
+ }
if (file) {
if (atomic_long_inc_not_zero(&file->f_count)) {
*fput_needed = 1;
@@ -342,7 +353,7 @@ struct file *fget_light(unsigned int fd, int *fput_needed)
* Check if other threads freed this file and
* re-allocated it.
*/
- if (unlikely(file != fcheck_files(files, fd))) {
+ if (unlikely(file != rcu_dereference(*filp))) {
*fput_needed = 0;
put_filp(file);
file = NULL;

2008-12-12 05:12:28

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes

Nick Piggin a ?crit :
> On Friday 12 December 2008 09:39, Eric Dumazet wrote:
>> Avoids cache line ping pongs between cpus and prepare next patch,
>> because updates of nr_inodes dont need inode_lock anymore.
>>
>> (socket8 bench result : no difference at this point)
>
> Looks good.
>
> But.... If we never actually need fast access to the approximate
> total, (which seems to apply to this and the previous patch) we
> could use something much simpler which does not have the spinlock
> or all this batching stuff that percpu counters have. I'd prefer
> that because it will be faster in a straight line...

Well, using a non batching mode could be real easy, just
call __percpu_counter_add(&counter, inc, 1<<30);

Or define a new percpu_counter_fastadd(&counter, inc);

percpu_counter are nice because handle the CPU hotplug problem,
if we want to use for_each_online_cpu() instead of
for_each_possible_cpu().

>
> (BTW. percpu counters can't be used in interrupt context? That's
> nice.)
>
>

Not sure why you said this.

I would like to have a irqsafe percpu_counter, I was preparing such a
patch because we need it for net-next


2008-12-12 16:49:19

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

Eric Dumazet a ?crit :
> Nick Piggin a ?crit :
>> On Friday 12 December 2008 09:40, Eric Dumazet wrote:
>>> From: Christoph Lameter <[email protected]>
>>>
>>> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU
>>>
>>> Currently we schedule RCU frees for each file we free separately. That has
>>> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
>>> did not require RCU callbacks:
>>>
>>> 1. Excessive number of RCU callbacks can be generated causing long RCU
>>> queues that in turn cause long latencies. We hit SLUB page allocation
>>> more often than necessary.
>>>
>>> 2. The cache hot object is not preserved between free and realloc. A close
>>> followed by another open is very fast with the RCUless approach because
>>> the last freed object is returned by the slab allocator that is
>>> still cache hot. RCU free means that the object is not immediately
>>> available again. The new object is cache cold and therefore open/close
>>> performance tests show a significant degradation with the RCU
>>> implementation.
>>>
>>> One solution to this problem is to move the RCU freeing into the Slab
>>> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
>>> time. The slab allocator will do RCU frees only when it is necessary
>>> to dispose of slabs of objects (rare). So with that approach we can cut
>>> out the RCU overhead significantly.
>>>
>>> However, the slab allocator may return the object for another use even
>>> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
>>> there is the (unlikely) possibility that the object is going to be
>>> switched under us in sections protected by rcu_read_lock() and
>>> rcu_read_unlock(). So we need to verify that we have acquired the correct
>>> object after establishing a stable object reference (incrementing the
>>> refcounter does that).
>>>
>>>
>>> Signed-off-by: Christoph Lameter <[email protected]>
>>> Signed-off-by: Eric Dumazet <[email protected]>
>>> Signed-off-by: Paul E. McKenney <[email protected]>
>>> ---
>>> Documentation/filesystems/files.txt | 21 ++++++++++++++--
>>> fs/file_table.c | 33 ++++++++++++++++++--------
>>> include/linux/fs.h | 5 ---
>>> 3 files changed, 42 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/Documentation/filesystems/files.txt
>>> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644
>>> --- a/Documentation/filesystems/files.txt
>>> +++ b/Documentation/filesystems/files.txt
>>> @@ -78,13 +78,28 @@ the fdtable structure -
>>> that look-up may race with the last put() operation on the
>>> file structure. This is avoided using atomic_long_inc_not_zero()
>>> on ->f_count :
>>> + As file structures are allocated with SLAB_DESTROY_BY_RCU,
>>> + they can also be freed before a RCU grace period, and reused,
>>> + but still as a struct file.
>>> + It is necessary to check again after getting
>>> + a stable reference (ie after atomic_long_inc_not_zero()),
>>> + that fcheck_files(files, fd) points to the same file.
>>>
>>> rcu_read_lock();
>>> file = fcheck_files(files, fd);
>>> if (file) {
>>> - if (atomic_long_inc_not_zero(&file->f_count))
>>> + if (atomic_long_inc_not_zero(&file->f_count)) {
>>> *fput_needed = 1;
>>> - else
>>> + /*
>>> + * Now we have a stable reference to an object.
>>> + * Check if other threads freed file and reallocated it.
>>> + */
>>> + if (file != fcheck_files(files, fd)) {
>>> + *fput_needed = 0;
>>> + put_filp(file);
>>> + file = NULL;
>>> + }
>>> + } else
>>> /* Didn't get the reference, someone's freed */
>>> file = NULL;
>>> }
>>> @@ -95,6 +110,8 @@ the fdtable structure -
>>> atomic_long_inc_not_zero() detects if refcounts is already zero or
>>> goes to zero during increment. If it does, we fail
>>> fget()/fget_light().
>>> + The second call to fcheck_files(files, fd) checks that this filp
>>> + was not freed, then reused by an other thread.
>>>
>>> 6. Since both fdtable and file structures can be looked up
>>> lock-free, they must be installed using rcu_assign_pointer()
>>> diff --git a/fs/file_table.c b/fs/file_table.c
>>> index a46e880..3e9259d 100644
>>> --- a/fs/file_table.c
>>> +++ b/fs/file_table.c
>>> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly;
>>>
>>> static struct percpu_counter nr_files __cacheline_aligned_in_smp;
>>>
>>> -static inline void file_free_rcu(struct rcu_head *head)
>>> -{
>>> - struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
>>> - kmem_cache_free(filp_cachep, f);
>>> -}
>>> -
>>> static inline void file_free(struct file *f)
>>> {
>>> percpu_counter_dec(&nr_files);
>>> file_check_state(f);
>>> - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
>>> + kmem_cache_free(filp_cachep, f);
>>> }
>>>
>>> /*
>>> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd)
>>> rcu_read_unlock();
>>> return NULL;
>>> }
>>> + /*
>>> + * Now we have a stable reference to an object.
>>> + * Check if other threads freed file and re-allocated it.
>>> + */
>>> + if (unlikely(file != fcheck_files(files, fd))) {
>>> + put_filp(file);
>>> + file = NULL;
>>> + }
>> This is a non-trivial change, because that put_filp may drop the last
>> reference to the file. So now we have the case where we free the file
>> from a context in which it had never been allocated.
>
> If we got at this point, we :
>
> Found a non NULL pointer in our fd table.
> Then, another thread came, closed the file while we not yet added our reference.
> This file was freed (kmem_cache_free(filp_cachep, file))
> This file was reused and inserted on another thread fd table.
> We added our reference on refcount.
> We checked if this file is still ours (in our fd tab).
> We found this file is not anymore the file we wanted.
> Calling put_filp() here is our only choice to safely remove the reference on
> a truly allocated file. At this point the file is
> a truly allocated file but not anymore ours.
> Unfortunatly we added a reference on it : we must release it.
> If the other thread already called put_filp() because it wanted to close its new file,
> we must see f_refcnt going to zero, and we must call __fput(), to perform
> all the relevant file cleanup ourself.

Reading again this mail I realise we call put_filp(file), while this should
be fput(file) or put_filp(file), we dont know.

Damned, this patch is wrong as is.

Christoph, Paul, do you see the problem ?

In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file,
and tried to close it while we got a reference on file) had to call put_filp() or fput()
to release its own reference. So we call atomic_long_dec_and_test() and cannot
take the appropriate action (calling the full __fput() version or the small one,
that some systems use to 'close' an not really opened file.

void put_filp(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count)) {
security_file_free(file);
file_kill(file);
file_free(file);
}
}

void fput(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count))
__fput(file);
}

I believe put_filp() is only called on slowpath (error cases).

Should we just zap it and always call fput() ?


2008-12-13 01:43:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

On Fri, 12 Dec 2008, Eric Dumazet wrote:


> > This is a non-trivial change, because that put_filp may drop the last
> > reference to the file. So now we have the case where we free the file
> > from a context in which it had never been allocated.
>
> If we got at this point, we :
>
> Found a non NULL pointer in our fd table.
> Then, another thread came, closed the file while we not yet added our reference.
> This file was freed (kmem_cache_free(filp_cachep, file))
> This file was reused and inserted on another thread fd table.
> We added our reference on refcount.
> We checked if this file is still ours (in our fd tab).
> We found this file is not anymore the file we wanted.
> Calling put_filp() here is our only choice to safely remove the reference on
> a truly allocated file. At this point the file is
> a truly allocated file but not anymore ours.
> Unfortunatly we added a reference on it : we must release it.
> If the other thread already called put_filp() because it wanted to close its new file,
> we must see f_refcnt going to zero, and we must call __fput(), to perform
> all the relevant file cleanup ourself.

Correct. That was the idea.

> A final point is that SLUB doesnt need to allocate or free a slab in many cases.
> (This is probably why Christoph needed this patch in 2006 :) )

We needed this patch in 2006 because the AIM9 creat-clo test showed
regressions after the rcu free was put in (discovered during SLES11
verification cycle). All slab allocators do at least defer frees until all
objects in the page are freed if not longer.

> In my case, I need all these patches to speedup http servers.
> They obviously open and close many files per second.

Run AIM9 creat-close tests....

> SLAB_DESTROY_BY_RCU is a must on current hardware, where memory cache line
> misses costs become really problematic. This patch series clearly demonstrate
> it.

Well the issue becomes more severe as accesses to cold memory become more
extensive. Thanks for your work on this.

2008-12-13 02:09:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

On Fri, 12 Dec 2008, Eric Dumazet wrote:

> > a truly allocated file. At this point the file is
> > a truly allocated file but not anymore ours.

Its a valid file. Does ownership matter here?

> Reading again this mail I realise we call put_filp(file), while this should
> be fput(file) or put_filp(file), we dont know.
>
> Damned, this patch is wrong as is.
>
> Christoph, Paul, do you see the problem ?

Yes.

> In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file,
> and tried to close it while we got a reference on file) had to call put_filp() or fput()
> to release its own reference. So we call atomic_long_dec_and_test() and cannot
> take the appropriate action (calling the full __fput() version or the small one,
> that some systems use to 'close' an not really opened file.

The difference is mainly that fput() does full processing whereas
put_filp() is used when we know that the file was not fully operational.
If the checks in __fput are able to handle the put_filp() situation by not
releasing resources that were not allocated then we should be fine.

> I believe put_filp() is only called on slowpath (error cases).

Looks like it. It seems to assume that no dentry is associated.

> Should we just zap it and always call fput() ?

Only if fput() can handle partially setup files.

2008-12-16 21:04:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry

On Thu, Dec 11, 2008 at 11:38:56PM +0100, Eric Dumazet wrote:
> Adding a percpu_counter nr_dentry avoids cache line ping pongs
> between cpus to maintain this metric, and dcache_lock is
> no more needed to protect dentry_stat.nr_dentry
>
> We centralize nr_dentry updates at the right place :
> - increments in d_alloc()
> - decrements in d_free()
>
> d_alloc() can avoid taking dcache_lock if parent is NULL
>
> ("socketallocbench -n8" result : 27.5s to 25s)

Looks good! (At least once I realised that nr_dentry was global rather
than per-dentry!!!)

Reviewed-by: Paul E. McKenney <[email protected]>

> Signed-off-by: Eric Dumazet <[email protected]>
> ---
> fs/dcache.c | 49 +++++++++++++++++++++++++------------------
> include/linux/fs.h | 2 +
> kernel/sysctl.c | 2 -
> 3 files changed, 32 insertions(+), 21 deletions(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index fa1ba03..f463a81 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly;
> static unsigned int d_hash_mask __read_mostly;
> static unsigned int d_hash_shift __read_mostly;
> static struct hlist_head *dentry_hashtable __read_mostly;
> +static struct percpu_counter nr_dentry;
>
> /* Statistics gathering. */
> struct dentry_stat_t dentry_stat = {
> .age_limit = 45,
> };
>
> +/*
> + * Handle nr_dentry sysctl
> + */
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> +int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry);
> + return proc_dointvec(table, write, filp, buffer, lenp, ppos);
> +}
> +#else
> +int proc_nr_dentry(ctl_table *table, int write, struct file *filp,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + return -ENOSYS;
> +}
> +#endif
> +
> static void __d_free(struct dentry *dentry)
> {
> WARN_ON(!list_empty(&dentry->d_alias));
> @@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head)
> }
>
> /*
> - * no dcache_lock, please. The caller must decrement dentry_stat.nr_dentry
> - * inside dcache_lock.
> + * no dcache_lock, please.
> */
> static void d_free(struct dentry *dentry)
> {
> @@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry)
> __d_free(dentry);
> else
> call_rcu(&dentry->d_u.d_rcu, d_callback);
> + percpu_counter_dec(&nr_dentry);
> }
>
> /*
> @@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry)
> struct dentry *parent;
>
> list_del(&dentry->d_u.d_child);
> - dentry_stat.nr_dentry--; /* For d_free, below */
> /*drops the locks, at that point nobody can reach this dentry */
> dentry_iput(dentry);
> if (IS_ROOT(dentry))
> @@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb)
> static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
> {
> struct dentry *parent;
> - unsigned detached = 0;
>
> BUG_ON(!IS_ROOT(dentry));
>
> @@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
> }
>
> list_del(&dentry->d_u.d_child);
> - detached++;
>
> inode = dentry->d_inode;
> if (inode) {
> @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
> * otherwise we ascend to the parent and move to the
> * next sibling if there is one */
> if (!parent)
> - goto out;
> + return;
>
> dentry = parent;
>
> @@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
> dentry = list_entry(dentry->d_subdirs.next,
> struct dentry, d_u.d_child);
> }
> -out:
> - /* several dentries were freed, need to correct nr_dentry */
> - spin_lock(&dcache_lock);
> - dentry_stat.nr_dentry -= detached;
> - spin_unlock(&dcache_lock);
> }
>
> /*
> @@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
> dentry->d_flags = DCACHE_UNHASHED;
> spin_lock_init(&dentry->d_lock);
> dentry->d_inode = NULL;
> - dentry->d_parent = NULL;
> - dentry->d_sb = NULL;
> dentry->d_op = NULL;
> dentry->d_fsdata = NULL;
> dentry->d_mounted = 0;
> @@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name)
> if (parent) {
> dentry->d_parent = dget(parent);
> dentry->d_sb = parent->d_sb;
> + spin_lock(&dcache_lock);
> + list_add(&dentry->d_u.d_child, &parent->d_subdirs);
> + spin_unlock(&dcache_lock);
> } else {
> + dentry->d_parent = NULL;
> + dentry->d_sb = NULL;
> INIT_LIST_HEAD(&dentry->d_u.d_child);
> }
> -
> - spin_lock(&dcache_lock);
> - if (parent)
> - list_add(&dentry->d_u.d_child, &parent->d_subdirs);
> - dentry_stat.nr_dentry++;
> - spin_unlock(&dcache_lock);
> -
> + percpu_counter_inc(&nr_dentry);
> return dentry;
> }
>
> @@ -2282,6 +2290,7 @@ static void __init dcache_init(void)
> {
> int loop;
>
> + percpu_counter_init(&nr_dentry, 0);
> /*
> * A constructor could be added for stable state like the lists,
> * but it is probably not worth it because of the cache nature
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4a853ef..114cb65 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2217,6 +2217,8 @@ static inline void free_secdata(void *secdata)
> struct ctl_table;
> int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
> void __user *buffer, size_t *lenp, loff_t *ppos);
> +int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
> + void __user *buffer, size_t *lenp, loff_t *ppos);
>
> int get_filesystem_list(char * buf);
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 3d56fe7..777bee7 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1246,7 +1246,7 @@ static struct ctl_table fs_table[] = {
> .data = &dentry_stat,
> .maxlen = 6*sizeof(int),
> .mode = 0444,
> - .proc_handler = &proc_dointvec,
> + .proc_handler = &proc_nr_dentry,
> },
> {
> .ctl_name = FS_OVERFLOWUID,

2008-12-16 21:11:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes

On Thu, Dec 11, 2008 at 11:39:10PM +0100, Eric Dumazet wrote:
> Avoids cache line ping pongs between cpus and prepare next patch,
> because updates of nr_inodes dont need inode_lock anymore.
>
> (socket8 bench result : no difference at this point)

I do like this per-CPU counter infrastructure!

One small comment change noted below. Other than that:

Reviewed-by: Paul E. McKenney <[email protected]>

> Signed-off-by: Eric Dumazet <[email protected]>
> ---
> fs/fs-writeback.c | 2 +-
> fs/inode.c | 39 +++++++++++++++++++++++++++++++--------
> include/linux/fs.h | 3 +++
> kernel/sysctl.c | 4 ++--
> mm/page-writeback.c | 2 +-
> 5 files changed, 38 insertions(+), 12 deletions(-)
>
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d0ff0b8..b591cdd 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait)
> unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
>
> wbc.nr_to_write = nr_dirty + nr_unstable +
> - (inodes_stat.nr_inodes - inodes_stat.nr_unused) +
> + (get_nr_inodes() - inodes_stat.nr_unused) +
> nr_dirty + nr_unstable;
> wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */
> sync_sb_inodes(sb, &wbc);
> diff --git a/fs/inode.c b/fs/inode.c
> index 0487ddb..f94f889 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex);
> * Statistics gathering..
> */
> struct inodes_stat_t inodes_stat;
> +static struct percpu_counter nr_inodes;
>
> static struct kmem_cache * inode_cachep __read_mostly;
>
> +int get_nr_inodes(void)
> +{
> + return percpu_counter_sum_positive(&nr_inodes);
> +}
> +
> +/*
> + * Handle nr_dentry sysctl

That would be "nr_inode", right?

> + */
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
> +int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + inodes_stat.nr_inodes = get_nr_inodes();
> + return proc_dointvec(table, write, filp, buffer, lenp, ppos);
> +}
> +#else
> +int proc_nr_inodes(ctl_table *table, int write, struct file *filp,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + return -ENOSYS;
> +}
> +#endif
> +
> static void wake_up_inode(struct inode *inode)
> {
> /*
> @@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head)
> destroy_inode(inode);
> nr_disposed++;
> }
> - spin_lock(&inode_lock);
> - inodes_stat.nr_inodes -= nr_disposed;
> - spin_unlock(&inode_lock);
> + percpu_counter_sub(&nr_inodes, nr_disposed);
> }
>
> /*
> @@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb)
>
> inode = alloc_inode(sb);
> if (inode) {
> + percpu_counter_inc(&nr_inodes);
> spin_lock(&inode_lock);
> - inodes_stat.nr_inodes++;
> list_add(&inode->i_list, &inode_in_use);
> list_add(&inode->i_sb_list, &sb->s_inodes);
> inode->i_ino = ++last_ino;
> @@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h
> if (set(inode, data))
> goto set_failed;
>
> - inodes_stat.nr_inodes++;
> + percpu_counter_inc(&nr_inodes);
> list_add(&inode->i_list, &inode_in_use);
> list_add(&inode->i_sb_list, &sb->s_inodes);
> hlist_add_head(&inode->i_hash, head);
> @@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he
> old = find_inode_fast(sb, head, ino);
> if (!old) {
> inode->i_ino = ino;
> - inodes_stat.nr_inodes++;
> + percpu_counter_inc(&nr_inodes);
> list_add(&inode->i_list, &inode_in_use);
> list_add(&inode->i_sb_list, &sb->s_inodes);
> hlist_add_head(&inode->i_hash, head);
> @@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode)
> list_del_init(&inode->i_list);
> list_del_init(&inode->i_sb_list);
> inode->i_state |= I_FREEING;
> - inodes_stat.nr_inodes--;
> spin_unlock(&inode_lock);
> + percpu_counter_dec(&nr_inodes);
>
> security_inode_delete(inode);
>
> @@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode)
> list_del_init(&inode->i_list);
> list_del_init(&inode->i_sb_list);
> inode->i_state |= I_FREEING;
> - inodes_stat.nr_inodes--;
> spin_unlock(&inode_lock);
> + percpu_counter_dec(&nr_inodes);
> if (inode->i_data.nrpages)
> truncate_inode_pages(&inode->i_data, 0);
> clear_inode(inode);
> @@ -1394,6 +1416,7 @@ void __init inode_init(void)
> {
> int loop;
>
> + percpu_counter_init(&nr_inodes, 0);
> /* inode slab cache */
> inode_cachep = kmem_cache_create("inode_cache",
> sizeof(struct inode),
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 114cb65..a789346 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -47,6 +47,7 @@ struct inodes_stat_t {
> int dummy[5]; /* padding for sysctl ABI compatibility */
> };
> extern struct inodes_stat_t inodes_stat;
> +extern int get_nr_inodes(void);
>
> extern int leases_enable, lease_break_time;
>
> @@ -2219,6 +2220,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
> void __user *buffer, size_t *lenp, loff_t *ppos);
> int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp,
> void __user *buffer, size_t *lenp, loff_t *ppos);
> +int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp,
> + void __user *buffer, size_t *lenp, loff_t *ppos);
>
> int get_filesystem_list(char * buf);
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 777bee7..b705f3a 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1205,7 +1205,7 @@ static struct ctl_table fs_table[] = {
> .data = &inodes_stat,
> .maxlen = 2*sizeof(int),
> .mode = 0444,
> - .proc_handler = &proc_dointvec,
> + .proc_handler = &proc_nr_inodes,
> },
> {
> .ctl_name = FS_STATINODE,
> @@ -1213,7 +1213,7 @@ static struct ctl_table fs_table[] = {
> .data = &inodes_stat,
> .maxlen = 7*sizeof(int),
> .mode = 0444,
> - .proc_handler = &proc_dointvec,
> + .proc_handler = &proc_nr_inodes,
> },
> {
> .procname = "file-nr",
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 2970e35..a71a922 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg)
> next_jif = start_jif + dirty_writeback_interval;
> nr_to_write = global_page_state(NR_FILE_DIRTY) +
> global_page_state(NR_UNSTABLE_NFS) +
> - (inodes_stat.nr_inodes - inodes_stat.nr_unused);
> + (get_nr_inodes() - inodes_stat.nr_unused);
> while (nr_to_write > 0) {
> wbc.more_io = 0;
> wbc.encountered_congestion = 0;

2008-12-16 21:26:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator

On Thu, Dec 11, 2008 at 11:39:18PM +0100, Eric Dumazet wrote:
> new_inode() dirties a contended cache line to get increasing
> inode numbers.
>
> Solve this problem by providing to each cpu a per_cpu variable,
> feeded by the shared last_ino, but once every 1024 allocations.
>
> This reduce contention on the shared last_ino, and give same
> spreading ino numbers than before.
> (same wraparound after 2^32 allocations)

One question below, but just a clarification. Works correctly as is,
though a bit strangely.

Reviewed-by: Paul E. McKenney <[email protected]>

> Signed-off-by: Eric Dumazet <[email protected]>
> ---
> fs/inode.c | 35 ++++++++++++++++++++++++++++++++---
> 1 files changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index f94f889..dc8e72a 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -556,6 +556,36 @@ repeat:
> return node ? inode : NULL;
> }
>
> +#ifdef CONFIG_SMP
> +/*
> + * Each cpu owns a range of 1024 numbers.
> + * 'shared_last_ino' is dirtied only once out of 1024 allocations,
> + * to renew the exhausted range.
> + */
> +static DEFINE_PER_CPU(int, last_ino);
> +
> +static int last_ino_get(void)
> +{
> + static atomic_t shared_last_ino;
> + int *p = &get_cpu_var(last_ino);
> + int res = *p;
> +
> + if (unlikely((res & 1023) == 0))
> + res = atomic_add_return(1024, &shared_last_ino) - 1024;
> +
> + *p = ++res;

So the first CPU gets the range [1:1024], the second [1025:2048], and
so on, eventually wrapping to [4294966273:0]. Is that the intent?

(I don't see a problem with this, just seems a bit strange.)

> + put_cpu_var(last_ino);
> + return res;
> +}
> +#else
> +static int last_ino_get(void)
> +{
> + static int last_ino;
> +
> + return ++last_ino;
> +}
> +#endif
> +
> /**
> * new_inode - obtain an inode
> * @sb: superblock
> @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb)
> * error if st_ino won't fit in target struct field. Use 32bit counter
> * here to attempt to avoid that.
> */
> - static unsigned int last_ino;
> struct inode * inode;
>
> spin_lock_prefetch(&inode_lock);
> @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb)
> inode = alloc_inode(sb);
> if (inode) {
> percpu_counter_inc(&nr_inodes);
> + inode->i_state = 0;
> + inode->i_ino = last_ino_get();
> spin_lock(&inode_lock);
> list_add(&inode->i_list, &inode_in_use);
> list_add(&inode->i_sb_list, &sb->s_inodes);
> - inode->i_ino = ++last_ino;
> - inode->i_state = 0;
> spin_unlock(&inode_lock);
> }
> return inode;

2008-12-16 21:41:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3 5/7] fs: new_inode_single() and iput_single()

On Thu, Dec 11, 2008 at 11:40:07PM +0100, Eric Dumazet wrote:
> Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
> inodes allocation/freeing.
>
> SINGLE dentries are attached to inodes that dont need to be linked
> in a list of inodes, being "inode_in_use" or "sb->s_inodes"
> As inode_lock was taken only to protect these lists, we avoid taking it
> as well.
>
> Using iput_single() from dput_single() avoids taking inode_lock
> at freeing time.
>
> This patch has a very noticeable effect, because we avoid dirtying of
> three contended cache lines in new_inode(), and five cache lines in iput()
>
> ("socketallocbench -n 8" result : from 19.9s to 3.01s)

Nice!

Acked-by: Paul E. McKenney <[email protected]>

> Signed-off-by: Eric Dumazet <[email protected]>
> ---
> fs/anon_inodes.c | 2 +-
> fs/dcache.c | 2 +-
> fs/inode.c | 29 ++++++++++++++++++++---------
> fs/pipe.c | 2 +-
> include/linux/fs.h | 12 +++++++++++-
> net/socket.c | 2 +-
> 6 files changed, 35 insertions(+), 14 deletions(-)
>
> diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
> index 8bf83cb..89fd36d 100644
> --- a/fs/anon_inodes.c
> +++ b/fs/anon_inodes.c
> @@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd);
> */
> static struct inode *anon_inode_mkinode(void)
> {
> - struct inode *inode = new_inode(anon_inode_mnt->mnt_sb);
> + struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb);
>
> if (!inode)
> return ERR_PTR(-ENOMEM);
> diff --git a/fs/dcache.c b/fs/dcache.c
> index af3bfb3..3363853 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry)
> return;
> inode = dentry->d_inode;
> if (inode)
> - iput(inode);
> + iput_single(inode);
> d_free(dentry);
> }
>
> diff --git a/fs/inode.c b/fs/inode.c
> index dc8e72a..0fdfe1b 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode)
> kmem_cache_free(inode_cachep, (inode));
> }
>
> +void iput_single(struct inode *inode)
> +{
> + if (atomic_dec_and_test(&inode->i_count)) {
> + destroy_inode(inode);
> + percpu_counter_dec(&nr_inodes);
> + }
> +}
>
> /*
> * These are initializations that only need to be done
> @@ -587,8 +594,9 @@ static int last_ino_get(void)
> #endif
>
> /**
> - * new_inode - obtain an inode
> + * __new_inode - obtain an inode
> * @sb: superblock
> + * @single: if true, dont link new inode in a list
> *
> * Allocates a new inode for given superblock. The default gfp_mask
> * for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
> @@ -598,7 +606,7 @@ static int last_ino_get(void)
> * newly created inode's mapping
> *
> */
> -struct inode *new_inode(struct super_block *sb)
> +struct inode *__new_inode(struct super_block *sb, int single)
> {
> /*
> * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
> @@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb)
> */
> struct inode * inode;
>
> - spin_lock_prefetch(&inode_lock);
> -
> inode = alloc_inode(sb);
> if (inode) {
> percpu_counter_inc(&nr_inodes);
> inode->i_state = 0;
> inode->i_ino = last_ino_get();
> - spin_lock(&inode_lock);
> - list_add(&inode->i_list, &inode_in_use);
> - list_add(&inode->i_sb_list, &sb->s_inodes);
> - spin_unlock(&inode_lock);
> + if (single) {
> + INIT_LIST_HEAD(&inode->i_list);
> + INIT_LIST_HEAD(&inode->i_sb_list);
> + } else {
> + spin_lock(&inode_lock);
> + list_add(&inode->i_list, &inode_in_use);
> + list_add(&inode->i_sb_list, &sb->s_inodes);
> + spin_unlock(&inode_lock);
> + }
> }
> return inode;
> }
>
> -EXPORT_SYMBOL(new_inode);
> +EXPORT_SYMBOL(__new_inode);
>
> void unlock_new_inode(struct inode *inode)
> {
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 4de6dd5..8c51a0d 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = {
>
> static struct inode * get_pipe_inode(void)
> {
> - struct inode *inode = new_inode(pipe_mnt->mnt_sb);
> + struct inode *inode = new_inode_single(pipe_mnt->mnt_sb);
> struct pipe_inode_info *pipe;
>
> if (!inode)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index a789346..a702d81 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1899,7 +1899,17 @@ extern void __iget(struct inode * inode);
> extern void iget_failed(struct inode *);
> extern void clear_inode(struct inode *);
> extern void destroy_inode(struct inode *);
> -extern struct inode *new_inode(struct super_block *);
> +extern struct inode *__new_inode(struct super_block *, int);
> +static inline struct inode *new_inode(struct super_block *sb)
> +{
> + return __new_inode(sb, 0);
> +}
> +static inline struct inode *new_inode_single(struct super_block *sb)
> +{
> + return __new_inode(sb, 1);
> +}
> +extern void iput_single(struct inode *);
> +
> extern int should_remove_suid(struct dentry *);
> extern int file_remove_suid(struct file *);
>
> diff --git a/net/socket.c b/net/socket.c
> index 353c928..4017409 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -464,7 +464,7 @@ static struct socket *sock_alloc(void)
> struct inode *inode;
> struct socket *sock;
>
> - inode = new_inode(sock_mnt->mnt_sb);
> + inode = new_inode_single(sock_mnt->mnt_sb);
> if (!inode)
> return NULL;
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-12-16 21:40:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd

On Thu, Dec 11, 2008 at 11:39:38PM +0100, Eric Dumazet wrote:
> Sockets, pipes and anonymous fds have interesting properties.
>
> Like other files, they use a dentry and an inode.
>
> But dentries for these kind of files are not hashed into dcache,
> since there is no way someone can lookup such a file in the vfs tree.
> (/proc/{pid}/fd/{number} uses a different mechanism)
>
> Still, allocating and freeing such dentries are expensive processes,
> because we currently take dcache_lock inside d_alloc(), d_instantiate(),
> and dput(). This lock is very contended on SMP machines.
>
> This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
> a single one (for sockets, pipes, anonymous fd), and a new
> d_alloc_single(const struct qstr *name, struct inode *inode)
> method, called by the three subsystems.
>
> Internally, dput() can take a fast path to dput_single() for
> SINGLE dentries. No more atomic_dec_and_lock()
> for such dentries.
>
>
> Differences betwen an SINGLE dentry and a normal one are :
>
> 1) SINGLE dentry has the DCACHE_SINGLE flag
> 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
> This to avoid taking a reference on sb 'root' dentry, shared
> by too many dentries.
> 3) They are not hashed into global hash table (DCACHE_UNHASHED)
> 4) Their d_alias list is empty
>
> ("socketallocbench -n 8" bench result : from 25s to 19.9s)

Acked-by: Paul E. McKenney <[email protected]>

> Signed-off-by: Eric Dumazet <[email protected]>
> ---
> fs/anon_inodes.c | 16 ------------
> fs/dcache.c | 51 +++++++++++++++++++++++++++++++++++++++
> fs/pipe.c | 23 +----------------
> include/linux/dcache.h | 9 ++++++
> net/socket.c | 24 +-----------------
> 5 files changed, 65 insertions(+), 58 deletions(-)
>
> diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
> index 3662dd4..8bf83cb 100644
> --- a/fs/anon_inodes.c
> +++ b/fs/anon_inodes.c
> @@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags,
> mnt);
> }
>
> -static int anon_inodefs_delete_dentry(struct dentry *dentry)
> -{
> - /*
> - * We faked vfs to believe the dentry was hashed when we created it.
> - * Now we restore the flag so that dput() will work correctly.
> - */
> - dentry->d_flags |= DCACHE_UNHASHED;
> - return 1;
> -}
> -
> static struct file_system_type anon_inode_fs_type = {
> .name = "anon_inodefs",
> .get_sb = anon_inodefs_get_sb,
> .kill_sb = kill_anon_super,
> };
> static struct dentry_operations anon_inodefs_dentry_operations = {
> - .d_delete = anon_inodefs_delete_dentry,
> };
>
> /**
> @@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
> this.name = name;
> this.len = strlen(name);
> this.hash = 0;
> - dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
> + dentry = d_alloc_single(&this, anon_inode_inode);
> if (!dentry)
> goto err_put_unused_fd;
>
> @@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops,
> atomic_inc(&anon_inode_inode->i_count);
>
> dentry->d_op = &anon_inodefs_dentry_operations;
> - /* Do not publish this dentry inside the global dentry hash table */
> - dentry->d_flags &= ~DCACHE_UNHASHED;
> - d_instantiate(dentry, anon_inode_inode);
>
> error = -ENFILE;
> file = alloc_file(anon_inode_mnt, dentry,
> diff --git a/fs/dcache.c b/fs/dcache.c
> index f463a81..af3bfb3 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry)
> */
>
> /*
> + * special version of dput() for pipes/sockets/anon.
> + * These dentries are not present in hash table, we can avoid
> + * taking/dirtying dcache_lock
> + */
> +static void dput_single(struct dentry *dentry)
> +{
> + struct inode *inode;
> +
> + if (!atomic_dec_and_test(&dentry->d_count))
> + return;
> + inode = dentry->d_inode;
> + if (inode)
> + iput(inode);
> + d_free(dentry);
> +}
> +
> +/*
> * dput - release a dentry
> * @dentry: dentry to release
> *
> @@ -234,6 +251,11 @@ void dput(struct dentry *dentry)
> {
> if (!dentry)
> return;
> + /*
> + * single dentries (sockets/pipes/anon) fast path
> + */
> + if (dentry->d_flags & DCACHE_SINGLE)
> + return dput_single(dentry);
>
> repeat:
> if (atomic_read(&dentry->d_count) == 1)
> @@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode)
> return res;
> }
>
> +/**
> + * d_alloc_single - allocate SINGLE dentry
> + * @name: dentry name, given in a qstr structure
> + * @inode: inode to allocate the dentry for
> + *
> + * Allocate an SINGLE dentry for the inode given. The inode is
> + * instantiated and returned. %NULL is returned if there is insufficient
> + * memory.
> + * - SINGLE dentries have themselves as a parent.
> + * - SINGLE dentries are not hashed into global hash table
> + * - their d_alias list is empty
> + */
> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode)
> +{
> + struct dentry *entry;
> +
> + entry = d_alloc(NULL, name);
> + if (entry) {
> + entry->d_sb = inode->i_sb;
> + entry->d_parent = entry;
> + entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED;
> + entry->d_inode = inode;
> + fsnotify_d_instantiate(entry, inode);
> + security_d_instantiate(entry, inode);
> + }
> + return entry;
> +}
> +
> +
> static inline struct hlist_head *d_hash(struct dentry *parent,
> unsigned long hash)
> {
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 7aea8b8..4de6dd5 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode)
> }
>
> static struct vfsmount *pipe_mnt __read_mostly;
> -static int pipefs_delete_dentry(struct dentry *dentry)
> -{
> - /*
> - * At creation time, we pretended this dentry was hashed
> - * (by clearing DCACHE_UNHASHED bit in d_flags)
> - * At delete time, we restore the truth : not hashed.
> - * (so that dput() can proceed correctly)
> - */
> - dentry->d_flags |= DCACHE_UNHASHED;
> - return 0;
> -}
>
> /*
> * pipefs_dname() is called from d_path().
> @@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen)
> }
>
> static struct dentry_operations pipefs_dentry_operations = {
> - .d_delete = pipefs_delete_dentry,
> .d_dname = pipefs_dname,
> };
>
> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags)
> struct inode *inode;
> struct file *f;
> struct dentry *dentry;
> - struct qstr name = { .name = "" };
> + static const struct qstr name = { .name = "" };
>
> err = -ENFILE;
> inode = get_pipe_inode();
> @@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags)
> goto err;
>
> err = -ENOMEM;
> - dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
> + dentry = d_alloc_single(&name, inode);
> if (!dentry)
> goto err_inode;
>
> dentry->d_op = &pipefs_dentry_operations;
> - /*
> - * We dont want to publish this dentry into global dentry hash table.
> - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
> - * This permits a working /proc/$pid/fd/XXX on pipes
> - */
> - dentry->d_flags &= ~DCACHE_UNHASHED;
> - d_instantiate(dentry, inode);
>
> err = -ENFILE;
> f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops);
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index a37359d..ca8d269 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -176,6 +176,14 @@ d_iput: no no no yes
> #define DCACHE_UNHASHED 0x0010
>
> #define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched */
> +#define DCACHE_SINGLE 0x0040
> + /*
> + * socket, pipe or anonymous fd dentry
> + * - SINGLE dentries have themselves as a parent.
> + * - SINGLE dentries are not hashed into global hash table
> + * - Their d_alias list is empty
> + * - They dont need dcache_lock synchronization
> + */
>
> extern spinlock_t dcache_lock;
> extern seqlock_t rename_lock;
> @@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *);
> extern void shrink_dcache_parent(struct dentry *);
> extern void shrink_dcache_for_umount(struct super_block *);
> extern int d_invalidate(struct dentry *);
> +extern struct dentry *d_alloc_single(const struct qstr *, struct inode *);
>
> /* only used at mount-time */
> extern struct dentry * d_alloc_root(struct inode *);
> diff --git a/net/socket.c b/net/socket.c
> index 92764d8..353c928 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -308,18 +308,6 @@ static struct file_system_type sock_fs_type = {
> .kill_sb = kill_anon_super,
> };
>
> -static int sockfs_delete_dentry(struct dentry *dentry)
> -{
> - /*
> - * At creation time, we pretended this dentry was hashed
> - * (by clearing DCACHE_UNHASHED bit in d_flags)
> - * At delete time, we restore the truth : not hashed.
> - * (so that dput() can proceed correctly)
> - */
> - dentry->d_flags |= DCACHE_UNHASHED;
> - return 0;
> -}
> -
> /*
> * sockfs_dname() is called from d_path().
> */
> @@ -330,7 +318,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen)
> }
>
> static struct dentry_operations sockfs_dentry_operations = {
> - .d_delete = sockfs_delete_dentry,
> .d_dname = sockfs_dname,
> };
>
> @@ -372,20 +359,13 @@ static int sock_alloc_fd(struct file **filep, int flags)
> static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
> {
> struct dentry *dentry;
> - struct qstr name = { .name = "" };
> + static const struct qstr name = { .name = "" };
>
> - dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
> + dentry = d_alloc_single(&name, SOCK_INODE(sock));
> if (unlikely(!dentry))
> return -ENOMEM;
>
> dentry->d_op = &sockfs_dentry_operations;
> - /*
> - * We dont want to push this dentry into global dentry hash table.
> - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED
> - * This permits a working /proc/$pid/fd/XXX on sockets
> - */
> - dentry->d_flags &= ~DCACHE_UNHASHED;
> - d_instantiate(dentry, SOCK_INODE(sock));
>
> sock->file = file;
> init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE,
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-12-17 20:27:37

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

Christoph Lameter a ?crit :
> On Fri, 12 Dec 2008, Eric Dumazet wrote:
>
>>> a truly allocated file. At this point the file is
>>> a truly allocated file but not anymore ours.
>
> Its a valid file. Does ownership matter here?
>
>> Reading again this mail I realise we call put_filp(file), while this should
>> be fput(file) or put_filp(file), we dont know.
>>
>> Damned, this patch is wrong as is.
>>
>> Christoph, Paul, do you see the problem ?
>
> Yes.
>
>> In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file,
>> and tried to close it while we got a reference on file) had to call put_filp() or fput()
>> to release its own reference. So we call atomic_long_dec_and_test() and cannot
>> take the appropriate action (calling the full __fput() version or the small one,
>> that some systems use to 'close' an not really opened file.
>
> The difference is mainly that fput() does full processing whereas
> put_filp() is used when we know that the file was not fully operational.
> If the checks in __fput are able to handle the put_filp() situation by not
> releasing resources that were not allocated then we should be fine.
>
>> I believe put_filp() is only called on slowpath (error cases).
>
> Looks like it. It seems to assume that no dentry is associated.
>
>> Should we just zap it and always call fput() ?
>
> Only if fput() can handle partially setup files.

It can do that if we add a check for NULL dentry in __fput(), so put_filp() can disappear.

But there is a remaining point where we do an atomic_long_dec_and_test(&...->f_count),
in fs/aio.c, function __aio_put_req(). This one is tricky :(