ixg(4) performances

Discussion:

ixg(4) performances

Emmanuel Dreyfus

2014-08-26 12:17:28 UTC

Hi

ixgb(4) has poor performances, even on latest -current. Here is the
dmesg output:
ixg1 at pci5 dev 0 function 1: Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.3.10
ixg1: clearing prefetchable bit
ixg1: interrupting at ioapic0 pin 9
ixg1: PCI Express Bus: Speed 2.5Gb/s Width x8

The interface is configued with:
ifconfig ixg1 mtu 9000 tso4 ip4csum tcp4csum-tx udp4csum-tx

And sysctl:
kern.sbmax = 67108864
kern.somaxkva = 67108864
net.inet.udp.sendspace = 2097152
net.inet.udp.recvspace = 2097152
net.inet.tcp.sendspace = 2097152
net.inet.tcp.recvspace = 2097152
net.inet.tcp.recvbuf_auto = 0
net.inet.tcp.sendbuf_auto = 0

netperfs shows a maximum performance of 2.3 Gb/s. That let me with
the feeling that only a PCI lane is used. Is it possible?

I also found this page that tackles the same problem on Linux:
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe

They tweak the PCI MMRBC. Anyone has an idea of how it could be
done on NetBSD? I thought about borrowing code from src/sys/dec/pci/if_dge.c
but I am not sure what pci_conf_read/pci_conf_write commands should be used.

Any other idea on how to improve performance?

--
Emmanuel Dreyfus
***@netbsd.org

Christos Zoulas

2014-08-26 12:57:37 UTC

Post by Emmanuel Dreyfus
Hi
ixgb(4) has poor performances, even on latest -current. Here is the
ixg1 at pci5 dev 0 function 1: Intel(R) PRO/10GbE PCI-Express Network
Driver, Version - 2.3.10
ixg1: clearing prefetchable bit
ixg1: interrupting at ioapic0 pin 9
ixg1: PCI Express Bus: Speed 2.5Gb/s Width x8
ifconfig ixg1 mtu 9000 tso4 ip4csum tcp4csum-tx udp4csum-tx
kern.sbmax = 67108864
kern.somaxkva = 67108864
net.inet.udp.sendspace = 2097152
net.inet.udp.recvspace = 2097152
net.inet.tcp.sendspace = 2097152
net.inet.tcp.recvspace = 2097152
net.inet.tcp.recvbuf_auto = 0
net.inet.tcp.sendbuf_auto = 0
netperfs shows a maximum performance of 2.3 Gb/s. That let me with
the feeling that only a PCI lane is used. Is it possible?
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe
They tweak the PCI MMRBC. Anyone has an idea of how it could be
done on NetBSD? I thought about borrowing code from src/sys/dec/pci/if_dge.c
but I am not sure what pci_conf_read/pci_conf_write commands should be used.
Any other idea on how to improve performance?

ftp://ftp.supermicro.com/CDR-C2_1.20_for_Intel_C2_platform/Intel/LAN/v15.5/PROXGB/DOCS/SERVER/prform10.htm#Setting_MMRBC

Emmanuel Dreyfus

2014-08-26 14:23:19 UTC

Post by Christos Zoulas
ftp://ftp.supermicro.com/CDR-C2_1.20_for_Intel_C2_platform/Intel/LAN/v15.5/PROXGB/DOCS/SERVER/prform10.htm#Setting_MMRBC

Right, but NetBSD has no tool like Linux's setpci to tweak MMRBC, and if
the BIOS has no setting for it, NetBSD is screwed.

I see <dev/pci/pciio.h> has a PCI_IOC_CFGREAD / PCI_IOC_CFGWRITE ioctl,
does that means Linux's setpci can be easily reproduced?

--
Emmanuel Dreyfus
***@netbsd.org

Christos Zoulas

2014-08-26 14:25:52 UTC

On Aug 26, 2:23pm, ***@netbsd.org (Emmanuel Dreyfus) wrote:
-- Subject: Re: ixg(4) performances

| On Tue, Aug 26, 2014 at 12:57:37PM +0000, Christos Zoulas wrote:
| > ftp://ftp.supermicro.com/CDR-C2_1.20_for_Intel_C2_platform/Intel/LAN/v15.5/PROXGB/DOCS/SERVER/prform10.htm#Setting_MMRBC
|
| Right, but NetBSD has no tool like Linux's setpci to tweak MMRBC, and if
| the BIOS has no setting for it, NetBSD is screwed.
|
| I see <dev/pci/pciio.h> has a PCI_IOC_CFGREAD / PCI_IOC_CFGWRITE ioctl,
| does that means Linux's setpci can be easily reproduced?

I would probably extend pcictl with cfgread and cfgwrite commands.

christos

Emmanuel Dreyfus

2014-08-26 14:42:55 UTC

Post by Christos Zoulas
I would probably extend pcictl with cfgread and cfgwrite commands.

Sure, once it works I can do that, but a first attempt just
ets EINVAL, any idea what can be wrong?

int fd;
struct pciio_bdf_cfgreg pbcr;

if ((fd = open("/dev/pci5", O_RDWR, 0)) == -1)
err(EX_OSERR, "open /dev/pci5 failed");

pbcr.bus = 5;
pbcr.device = 0;
pbcr.function = 0;
pbcr.cfgreg.reg = 0xe6b;
pbcr.cfgreg.val = 0x2e;

if (ioctl(fd, PCI_IOC_BDF_CFGWRITE, &pbcr) == -1)
err(EX_OSERR, "ioctl failed");

Inside the kernel, the only EINVAL is here:
if (bdfr->bus > 255 || bdfr->device >= sc->sc_maxndevs ||
bdfr->function > 7)
return EINVAL;

--
Emmanuel Dreyfus
***@netbsd.org

Christos Zoulas

2014-08-26 15:13:50 UTC

On Aug 26, 2:42pm, ***@netbsd.org (Emmanuel Dreyfus) wrote:
-- Subject: Re: ixg(4) performances

| On Tue, Aug 26, 2014 at 10:25:52AM -0400, Christos Zoulas wrote:
| > I would probably extend pcictl with cfgread and cfgwrite commands.
|
| Sure, once it works I can do that, but a first attempt just
| ets EINVAL, any idea what can be wrong?
|
| int fd;
| struct pciio_bdf_cfgreg pbcr;
|
| if ((fd = open("/dev/pci5", O_RDWR, 0)) == -1)
| err(EX_OSERR, "open /dev/pci5 failed");
|
| pbcr.bus = 5;
| pbcr.device = 0;
| pbcr.function = 0;
| pbcr.cfgreg.reg = 0xe6b;
| pbcr.cfgreg.val = 0x2e;

I think in the example that was 0xe6. I think the .b means byte access
(I am guessing). I think that we are only doing word accesses, thus
we probably need to read, mask modify write the byte. I have not
verified any of that, these are guesses... Look at the pcictl source
code.

|
| if (ioctl(fd, PCI_IOC_BDF_CFGWRITE, &pbcr) == -1)
| err(EX_OSERR, "ioctl failed");
|
| Inside the kernel, the only EINVAL is here:
| if (bdfr->bus > 255 || bdfr->device >= sc->sc_maxndevs ||
| bdfr->function > 7)
| return EINVAL;
|
| --
| Emmanuel Dreyfus
| ***@netbsd.org
-- End of excerpt from Emmanuel Dreyfus

Emmanuel Dreyfus

2014-08-26 15:43:05 UTC

Post by Christos Zoulas
I think in the example that was 0xe6. I think the .b means byte access
(I am guessing).

Yes, I came to that conclusion reading pciutils sources. I discovered
they also had a man page explaining that -)

Post by Christos Zoulas
I think that we are only doing word accesses, thus
we probably need to read, mask modify write the byte. I have not
verified any of that, these are guesses... Look at the pcictl source
code.

I try writting at register 0xe4, but when reading again it is still 0.

if (pcibus_conf_read(fd, 5, 0, 1, 0x00e4, &val) != 0)
err(EX_OSERR, "pcibus_conf_read failed");

printf("reg = 0x00e4, val = 0x%08x\n", val);

val = (val & 0xff00ffff) | 0x002e0000;

if (pcibus_conf_write(fd, 5, 0, 1, 0x00e4, val) != 0)
err(EX_OSERR, "pcibus_conf_write failed");

--
Emmanuel Dreyfus
***@netbsd.org

Taylor R Campbell

2014-08-26 15:51:24 UTC

Date: Tue, 26 Aug 2014 14:42:55 +0000

Post by Christos Zoulas
I would probably extend pcictl with cfgread and cfgwrite commands.

Sure, once it works I can do that, but a first attempt just
ets EINVAL, any idea what can be wrong?
...
pbcr.bus = 5;
pbcr.device = 0;
pbcr.function = 0;
pbcr.cfgreg.reg = 0xe6b;
pbcr.cfgreg.val = 0x2e;

Can't do unaligned register reads/writes. If you need other than
32-bit access, you need to select subwords for reads or do R/M/W for
writes.

Inside the kernel, the only EINVAL is here:
if (bdfr->bus > 255 || bdfr->device >= sc->sc_maxndevs ||
bdfr->function > 7)
return EINVAL;

Old kernel sources? I added a check recently for 32-bit alignment --
without which you'd hit a kassert or hardware trap shortly afterward.

Taylor R Campbell

2014-08-26 15:40:41 UTC

Date: Tue, 26 Aug 2014 10:25:52 -0400
From: ***@zoulas.com (Christos Zoulas)

On Aug 26, 2:23pm, ***@netbsd.org (Emmanuel Dreyfus) wrote:
-- Subject: Re: ixg(4) performances

| I see <dev/pci/pciio.h> has a PCI_IOC_CFGREAD / PCI_IOC_CFGWRITE ioctl,
| does that means Linux's setpci can be easily reproduced?

I would probably extend pcictl with cfgread and cfgwrite commands.

How about the attached patch? I've been sitting on this for months.

Taylor R Campbell

2014-08-26 16:40:25 UTC

Date: Tue, 26 Aug 2014 15:40:41 +0000
From: Taylor R Campbell <***@NetBSD.org>

How about the attached patch? I've been sitting on this for months.

New version with some changes suggested by ***@.

Emmanuel Dreyfus

2014-08-27 07:48:31 UTC

Post by Taylor R Campbell
How about the attached patch? I've been sitting on this for months.

Both changes seem fine, but the board does not behave as told by
Linux crowd. At 0xe6 is a nul value where we should have 0x22,
and attemps to change it does not seem to have any effect.

--
Emmanuel Dreyfus
***@netbsd.org

Emmanuel Dreyfus

2014-08-28 07:28:32 UTC

Anyone has objection to this change being committed and pulled up to
netbsd-7?

--
Emmanuel Dreyfus
***@netbsd.org

Christos Zoulas

2014-08-28 14:44:09 UTC

Post by Emmanuel Dreyfus
Anyone has objection to this change being committed and pulled up to
netbsd-7?

Not me.

christos

David Young

2014-08-26 17:44:43 UTC

Post by Christos Zoulas
-- Subject: Re: ixg(4) performances
| > ftp://ftp.supermicro.com/CDR-C2_1.20_for_Intel_C2_platform/Intel/LAN/v15.5/PROXGB/DOCS/SERVER/prform10.htm#Setting_MMRBC
|
| Right, but NetBSD has no tool like Linux's setpci to tweak MMRBC, and if
| the BIOS has no setting for it, NetBSD is screwed.
|
| I see <dev/pci/pciio.h> has a PCI_IOC_CFGREAD / PCI_IOC_CFGWRITE ioctl,
| does that means Linux's setpci can be easily reproduced?
I would probably extend pcictl with cfgread and cfgwrite commands.

Emmanuel,

Most (all?) configuration registers are read/write. Have you read the
MMRBC and found that it's improperly configured?

Are you sure that you don't have to program the MMRBC at every bus
bridge between the NIC and RAM? I'm not too familiar with PCI Express,
so I really don't know.

Have you verified the information at
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe with the 82599
manual? I have tried to corroborate the information both with my PCI
Express book and with the 82599 manual, but I cannot make a match.
PCI-X != PCI Express; maybe ixgb != ixgbe? (It sure looks like they're
writing about an 82599, but maybe they don't know what they're writing
about!)

Finally, adding cfgread/cfgwrite commands to pcictl seems like a step in
the wrong direction. I know that this is UNIX and we're duty-bound to
give everyone enough rope, but may we reconsider our assisted-suicide
policy just this one time? :-)

How well has blindly poking configuration registers worked for us in
the past? I can think of a couple of instances where an knowledgeable
developer thought that they were writing a helpful value to a useful
register and getting a desirable result, but in the end it turned out to
be a no-op. In one case, it was an Atheros WLAN adapter where somebody
added to Linux some code that wrote to a mysterious PCI configuration
register, and then some of the *BSDs copied it. In the other case, I
think that somebody used pci_conf_write() to write a magic value to a
USB host controller register that wasn't on a 32-bit boundary. ISTR
that some incorrect value was written, instead.

Dave

--
David Young
***@pobox.com Urbana, IL (217) 721-9981

matthew green

2014-08-26 18:50:58 UTC

Post by David Young
Finally, adding cfgread/cfgwrite commands to pcictl seems like a step in
the wrong direction. I know that this is UNIX and we're duty-bound to
give everyone enough rope, but may we reconsider our assisted-suicide
policy just this one time? :-)
How well has blindly poking configuration registers worked for us in
the past? I can think of a couple of instances where an knowledgeable
developer thought that they were writing a helpful value to a useful
register and getting a desirable result, but in the end it turned out to
be a no-op. In one case, it was an Atheros WLAN adapter where somebody
added to Linux some code that wrote to a mysterious PCI configuration
register, and then some of the *BSDs copied it. In the other case, I
think that somebody used pci_conf_write() to write a magic value to a
USB host controller register that wasn't on a 32-bit boundary. ISTR
that some incorrect value was written, instead.

pciutils' "setpci" utility has exposed this for lots of systems for
years. i don't see any value in keeping pcictl from being as usable
as other tools, and as you say, this is unix - rope and all.

.mrg.

Hisashi T Fujinaka

2014-08-26 21:36:55 UTC

Post by David Young
How well has blindly poking configuration registers worked for us in
the past?

Well, with the part he's using (the 82599, I think) it shouldn't be that
blind. The datasheet has all the registers listed, which is the case for
most of Intel's Ethernet controllers.

--
Hisashi T Fujinaka - ***@twofifty.com
BSEE(6/86) + BSChem(3/95) + BAEnglish(8/95) + MSCS(8/03) + $2.50 = latte

Taylor R Campbell

2014-08-27 03:16:14 UTC

Date: Tue, 26 Aug 2014 12:44:43 -0500
From: David Young <***@pobox.com>

Finally, adding cfgread/cfgwrite commands to pcictl seems like a step in
the wrong direction. I know that this is UNIX and we're duty-bound to
give everyone enough rope, but may we reconsider our assisted-suicide
policy just this one time? :-)

It's certainly wrong to rely on pcictl to read and write config
registers, but it's useful as a debugging tool and for driver
development -- just like the rest of pcictl.

Emmanuel Dreyfus

2014-08-28 07:26:44 UTC

Post by Emmanuel Dreyfus
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe

It seems that page describe a slightly different model.
Intel 82599 datasheet is available here:
http://www.intel.fr/content/www/fr/fr/ethernet-controllers/82599-10-gbe-controller-datasheet.html

No reference to MMRBC in this document, but I understand "Max Read Request
Size" is the same thing. Page 765 tells us about register A8, bits 12-14
that should be set to 100.
pcictl /dev/pci5 read -d 0 -f 1 0x18 tells me the value 0x00092810

I tried this command:
pcictl /dev/pci5 write -d 0 -f 1 0x18 0x00094810

Further pcictl read suggests it works as the new value is returned.
However it gives no performance improvement. This means that I
misunderstood what this register is about, or how to change it (byte order?).

Or the performance are constrained by something unrelated. In the blog
post cited above, the poster acheived more than 5 Gb/s before touching
MMRBC, while I am stuck at 2,7 GB/s. Any new idea welcome.

--
Emmanuel Dreyfus
***@netbsd.org

Stephan

2014-08-28 08:25:41 UTC

What is your test setup? Do you have 2 identical boxes?

Does it perform better e.g. on Linux or FreeBSD? If so, you could
check how the config registers get set by that particular OS.

Post by Emmanuel Dreyfus

Post by Emmanuel Dreyfus
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe

It seems that page describe a slightly different model.
http://www.intel.fr/content/www/fr/fr/ethernet-controllers/82599-10-gbe-controller-datasheet.html
No reference to MMRBC in this document, but I understand "Max Read Request
Size" is the same thing. Page 765 tells us about register A8, bits 12-14
that should be set to 100.
pcictl /dev/pci5 read -d 0 -f 1 0x18 tells me the value 0x00092810
pcictl /dev/pci5 write -d 0 -f 1 0x18 0x00094810
Further pcictl read suggests it works as the new value is returned.
However it gives no performance improvement. This means that I
misunderstood what this register is about, or how to change it (byte order?).
Or the performance are constrained by something unrelated. In the blog
post cited above, the poster acheived more than 5 Gb/s before touching
MMRBC, while I am stuck at 2,7 GB/s. Any new idea welcome.
--
Emmanuel Dreyfus

Terry Moore

2014-08-28 11:48:37 UTC

Post by Emmanuel Dreyfus
Or the performance are constrained by something unrelated. In the blog
post cited above, the poster acheived more than 5 Gb/s before touching
MMRBC, while I am stuck at 2,7 GB/s. Any new idea welcome.

The blog post refers to PCI-X, I'm more familiar with PCIe, but the concepts
are similar.

There are several possibilities, all revolving about differences between the
blog poster's base system and yorus.

1) the test case is using a platform that has better PCI performance (in the
PCIe world this could be: Gen3 versus Gen2 support in the slot being used;
more lanes in the slot being used)

2) the test case has a root complex with a PCI controller with better
performance than the one in your system;

3) the test case system has a different PCI configuration, in particular
different bridging. For example, a PCI bridge or switch on your platform
can change basic capabilities compared to the reference.

4) related to 3: one of the bridges on your system (between ixg and root
complex) is not configured for 4K reads, and so the setting on the ixg board
won't help [whereas this wasn't the case on the blog system].

5) related to 4: one of the bridges in your system (between ixg and root
complex) is not capable of 4K reads... (see 4).

And of course you have to consider:

6) the writer has something else different than you have, for example
silicon rev, BIOS, PCI-X where you have PCIe, etc.
7) the problem is completely unrelated to PCIe.

You're in a tough situation, experimentally, because you can't take a
working (5 Gbps) system and directly compare to the non-working (2.7 Gbps)
situation.

--Terry

Emmanuel Dreyfus

2014-08-29 03:54:53 UTC

Post by Terry Moore
There are several possibilities, all revolving about differences between the
blog poster's base system and yorus.

Do I have a way to investigate for appropriate PCI setup? Here is what
dmesg says about it:

pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
ppb4 at pci0 dev 14 function 0: vendor 0x10de product 0x005d (rev. 0xa3)
ppb4: PCI Express 1.0 <Root Port of PCI-E Root Complex>
pci5 at ppb4 bus 5
pci5: i/o space, memory space enabled, rd/line, wr/inv ok
ixg1 at pci5 dev 0 function 1: Intel(R) PRO/10GbE PCI-Express Network
Driver, Version - 2.3.10

--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
***@netbsd.org

Terry Moore

2014-08-29 12:48:51 UTC

-----Original Message-----
Behalf Of Emmanuel Dreyfus
Sent: Thursday, August 28, 2014 23:55
To: Terry Moore; 'Christos Zoulas'
Subject: Re: ixg(4) performances

Post by Terry Moore
There are several possibilities, all revolving about differences
between the blog poster's base system and yorus.

Do I have a way to investigate for appropriate PCI setup? Here is what
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
ppb4 at pci0 dev 14 function 0: vendor 0x10de product 0x005d (rev. 0xa3)
ppb4: PCI Express 1.0 <Root Port of PCI-E Root Complex>
pci5 at ppb4 bus 5
pci5: i/o space, memory space enabled, rd/line, wr/inv ok
ixg1 at pci5 dev 0 function 1: Intel(R) PRO/10GbE PCI-Express Network
Driver, Version - 2.3.10

I don't do PCIe on NetBSD -- these days we use it exclusively as VM guests
--, so I don't know what tools are available. Normally when doing this kind
of thing I poke around with a debugger or equivalent of pcictl.

The dmesg output tells us that your ixg is directly connected to an Nvdia
root complex. So there are no bridges, but this might be a relevant
difference to the benchmark system. It's more common to be connected to an
Intel southbridge chip of some kind.

Next step would be to check the documentation on, and the configuration of,
the root complex -- it must also be configured for 4K read ahead (because
the read will launch from the ixg, will be buffered in the root complex,
forwarded to the memory controller, and then the answers will come back.

(PCIe is very fast at the bit transfer level, but pretty slow in terms of
read transfers per second. Read transfer latency is on the order of 2.5
microseconds / operation. This is why 4K transfers are so important in this
application.)

Anyway, there are multiple vendors involved (Intel, Nvidia, your BIOS maker,
because the BIOS is responsible for setting things like the read size to the
maximum across the bus -- I'm speaking loosely, but basically config
software has to set things up because the individual devices don't have
enough knowledge). So generally that may explain things.

Still, you should check whether you have the right number of the right
generation of PCIe lanes connected to the ixg. If you look at the manual,
normally there's an obscure register that tells you how many lanes are
connected, and what generation. On the motherboards we use, each slot is
different, and it's not always obvious how the slots differ. Rather than
depending on documentation and the good intentions of the motherboard
developers, I always feel better looking at what the problem chip in
question thinks about number of lanes and speeds.

Hope this helps,
--Terry

Emmanuel Dreyfus

2014-08-29 15:51:14 UTC

Post by Terry Moore
Still, you should check whether you have the right number of the right
generation of PCIe lanes connected to the ixg.

I found this, but the result does not make sense: negociated > max ...

Link Capabilities Ragister (0xAC): 0x00027482
bits 3:0 Supprted Link speed: 0010 = 5 GbE and 2.5 GbE speed supported
bits 9:4 Max link width: 001000 = x4
bits 14:12 L0s exit lattency: 101 = 1 µs - 2 µs
bits 17:15 L1 Exit lattency: 011 = 4 µs - 8 µs

Link Status Register (0xB2): 0x1081
bits 3:0 Current Link speed: 0001 = 2.5 GbE PCIe link
bits 9:4 Negociated link width: 001000 = x8

--
Emmanuel Dreyfus
***@netbsd.org

Terry Moore

2014-08-29 16:22:31 UTC

Post by Emmanuel Dreyfus

Post by Terry Moore
Still, you should check whether you have the right number of the right
generation of PCIe lanes connected to the ixg.

I found this, but the result does not make sense: negociated > max ...
Link Capabilities Ragister (0xAC): 0x00027482
bits 3:0 Supprted Link speed: 0010 = 5 GbE and 2.5 GbE speed supported
bits 9:4 Max link width: 001000 = x4
bits 14:12 L0s exit lattency: 101 = 1 µs - 2 µs bits 17:15 L1 Exit
lattency: 011 = 4 µs - 8 µs
Link Status Register (0xB2): 0x1081
bits 3:0 Current Link speed: 0001 = 2.5 GbE PCIe link
bits 9:4 Negociated link width: 001000 = x8

I think there's a typo in the docs. In the PCIe spec, it says (for Link
Capabilities Register, table 7-15): 001000b is x8.

But it's running at gen1. I strongly suspect that the benchmark case was
gen2 (since the ixg is capable of it).

Is the ixg in an expansion slot or integrated onto the main board?

--Terry

Emmanuel Dreyfus

2014-08-29 17:10:44 UTC

Post by Terry Moore
But it's running at gen1. I strongly suspect that the benchmark case was
gen2 (since the ixg is capable of it).

gen1 vs gen2 is 2.5 Gb.s bs 5 Gb/s?

Post by Terry Moore
Is the ixg in an expansion slot or integrated onto the main board?

In a slot.

--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
***@netbsd.org

Hisashi T Fujinaka

2014-08-29 17:11:50 UTC

Post by Emmanuel Dreyfus

Post by Terry Moore
But it's running at gen1. I strongly suspect that the benchmark case was
gen2 (since the ixg is capable of it).

gen1 vs gen2 is 2.5 Gb.s bs 5 Gb/s?

Gen 1 is capable of only 2.5GT/s (gigatransfers per second). Gen 2 is
capable of up to 5, but isn't guaranteed to be 5. Depending on how
chatty the device is on the PCIe bus, I think 2.5GT/s is enough for
something much closer to line rate than you're getting.

--
Hisashi T Fujinaka - ***@twofifty.com
BSEE + BSChem + BAEnglish + MSCS + $2.50 = coffee

Terry Moore

2014-08-30 23:46:19 UTC

Forgot to cc the list.

-----Original Message-----
From: Terry Moore [mailto:***@mcci.com]
Sent: Friday, August 29, 2014 15:13
To: 'Emmanuel Dreyfus'
Subject: RE: ixg(4) performances

Post by Terry Moore
But it's running at gen1. I strongly suspect that the benchmark case was
gen2 (since the ixg is capable of it).

gen1 vs gen2 is 2.5 Gb.s vs 5 Gb/s?

Yes. Actually, 2.5Gbps is symbol rate -- it's 8/10 encoded, so one lane is
really 2Gbps. So 8 lanes is 16Gbps which *should* be enough, but... there's
overhead and a variety of sources of wastage.

I just saw today a slide that says that 8 lanes gen 1 is just barely enough
for 10Gb Eth
(http://www.eetimes.com/document.asp?doc_id=1323695&page_number=4).

It's possible that the benchmark system was using 8 lanes of Gen2.

No reference to MMRBC in this document, but I understand "Max Read Request

Size"

is the same thing. Page 765 tells us about register A8, bits 12-14
that should be set to 100.
pcictl /dev/pci5 read -d 0 -f 1 0x18 tells me the value 0x00092810
pcictl /dev/pci5 write -d 0 -f 1 0x18 0x00094810

In the PCIe spec, this is controlled via the Device Control register which
is a 16-bit value. You want to set *two* fields, to different values. 0x18
looks like the wrong offset. The PCIe spec says offset 0x08, but that's
relative to the base of the capability structure, the offset of which is in
the low byte of the dword at 0x34. I am running NetBSD 5; my pcictl doesn't
support write as one of its options, but I'd expect that to be relative to
the base of the function config space, and *that's not the device
capabilities register*. It's a read/write register, which is one of the Base
Address Registers.

In any case, in the Device Control Register, bits 7:5 are the max payload
size *for writes* by the igx to the system. These must be set to 101b for
4K max payload size.

Similarly, bits 14:12 are the max read request size. These must also be set
to 101b for 4K max read request.

Since you did a dword read, the extra 0x9.... is the device status register.
This makes me suspicious as the device status register is claiming that you
have "unsupported request detected)" [bit 3] and "correctable error
detected" [bit 0]. Further, this register is RW1C for all these bits -- so
when you write 94810, it should have cleared the 9 (so a subsequent read
should have returned 4810).

Please check.

Might be good to post a "pcictl dump" of your device, just to expose all the
details.

Post by Terry Moore
Is the ixg in an expansion slot or integrated onto the main board?

In a slot.

Check the manual on the main board and find out whether other slots have 8
lanes of Gen2.

If so, move the board.

Best regards,
--Terry

Hisashi T Fujinaka

2014-08-31 01:29:19 UTC

Doesn't anyone read my posts or, more important, the PCIe spec?

2.5 Giga TRANSFERS per second.

Post by Terry Moore
Forgot to cc the list.
-----Original Message-----
Sent: Friday, August 29, 2014 15:13
To: 'Emmanuel Dreyfus'
Subject: RE: ixg(4) performances

Post by Terry Moore
But it's running at gen1. I strongly suspect that the benchmark case was
gen2 (since the ixg is capable of it).

gen1 vs gen2 is 2.5 Gb.s vs 5 Gb/s?

Yes. Actually, 2.5Gbps is symbol rate -- it's 8/10 encoded, so one lane is
really 2Gbps. So 8 lanes is 16Gbps which *should* be enough, but... there's
overhead and a variety of sources of wastage.
I just saw today a slide that says that 8 lanes gen 1 is just barely enough
for 10Gb Eth
(http://www.eetimes.com/document.asp?doc_id=1323695&page_number=4).
It's possible that the benchmark system was using 8 lanes of Gen2.

No reference to MMRBC in this document, but I understand "Max Read Request

Size"

is the same thing. Page 765 tells us about register A8, bits 12-14
that should be set to 100.
pcictl /dev/pci5 read -d 0 -f 1 0x18 tells me the value 0x00092810
pcictl /dev/pci5 write -d 0 -f 1 0x18 0x00094810

In the PCIe spec, this is controlled via the Device Control register which
is a 16-bit value. You want to set *two* fields, to different values. 0x18
looks like the wrong offset. The PCIe spec says offset 0x08, but that's
relative to the base of the capability structure, the offset of which is in
the low byte of the dword at 0x34. I am running NetBSD 5; my pcictl doesn't
support write as one of its options, but I'd expect that to be relative to
the base of the function config space, and *that's not the device
capabilities register*. It's a read/write register, which is one of the Base
Address Registers.
In any case, in the Device Control Register, bits 7:5 are the max payload
size *for writes* by the igx to the system. These must be set to 101b for
4K max payload size.
Similarly, bits 14:12 are the max read request size. These must also be set
to 101b for 4K max read request.
Since you did a dword read, the extra 0x9.... is the device status register.
This makes me suspicious as the device status register is claiming that you
have "unsupported request detected)" [bit 3] and "correctable error
detected" [bit 0]. Further, this register is RW1C for all these bits -- so
when you write 94810, it should have cleared the 9 (so a subsequent read
should have returned 4810).
Please check.
Might be good to post a "pcictl dump" of your device, just to expose all the
details.

Post by Terry Moore
Is the ixg in an expansion slot or integrated onto the main board?

In a slot.

Check the manual on the main board and find out whether other slots have 8
lanes of Gen2.
If so, move the board.
Best regards,
--Terry

--
Hisashi T Fujinaka - ***@twofifty.com
BSEE + BSChem + BAEnglish + MSCS + $2.50 = coffee

Terry Moore

2014-08-31 16:07:38 UTC

-----Original Message-----
Sent: Saturday, August 30, 2014 21:29
To: Terry Moore
Subject: Re: FW: ixg(4) performances
Doesn't anyone read my posts or, more important, the PCIe spec?
2.5 Giga TRANSFERS per second.

I'm not sure I understand what you're saying.
"Signaling rate - Once initialized, each Link must only operate at one of
the supported signaling
levels. For the first generation of PCI Express technology, there is only
one signaling rate
defined, which provides an effective 2.5 Gigabits/second/Lane/direction of
raw bandwidth.
The second generation provides an effective 5.0
Gigabits/second/Lane/direction of raw
bandwidth. The third generation provides an effective 8.0
Gigabits/second/Lane/direction of
10 raw bandwidth. The data rate is expected to increase with technology
advances in the future."

This is not 2.5G Transfers per second. PCIe talks about transactions rather
than transfers; one transaction requires either 12 bytes (for 32-bit
systems) or 16 bytes (for 64-bit systems) of overhead at the transaction
layer, plus 7 bytes at the link layer.

The maximum number of transactions per second paradoxically transfers the
fewest number of bytes; a 4K write takes 16+4096+5+2 byte times, and so only
about 60,000 such transactions are possible per second (moving about
248,000,000 bytes). [Real systems don't see this, quite -- Wikipedia claims,
for example 95% efficiency is typical for storage controllers.]

A 4-byte write takes 16+4+5+2 byte times, and so roughly 9 million
transactions are possible per second, but those 9 million transactions can
only move 36 million bytes.

Multiple lanes scale things fairly linearly. But there has to be one byte
per lane; a x8 configuration says that physical transfers are padded so that
each the 4-byte write (which takes 27 bytes on the bus) will have to take 32
bytes. Instead of getting 72 million transactions per second, you get 62.5
million transactions/second, so it doesn't scale as nicely.

Reads are harder to analyze, because they depend on the speed and design of
both ends of the link. The reader sends a read request packet, and the
read-responder (some time later) sends back the response.

As far as I can see, even at gen3 with lots of lanes, PCIe doesn't scale to
2.5 G transfers per second.

Best regards,
--Terry

Hisashi T Fujinaka

2014-08-31 16:38:44 UTC

I may be wrong in the transactions/transfers. However, I think you're
reading the page incorrectly. The signalling rate is the physical speed
of the link. On top of that is the 8/10 encoding (the Ethernet
controller we're talking about is only Gen 2), the framing, etc, and
the spec discusses the data rate in GT/s. Gb/s means nothing.

It's like talking about the frequency of the Ethernet link, which we
never do. We talk about how much data can be transferred.

I'm also not sure if you've looked at an actual trace before, but a PCIe
link is incredibly chatty, and every transfer only has a payload of
64/128/256b (especially regarding the actual controller again).

So, those two coupled together (GT/s & chatty link with small packets)
means talking about things in Gb/s is not something used by people who
talk about PCIe every day (my day job). The signalling rate is not used
when talking about the max data transfer rate.

Post by Terry Moore

-----Original Message-----
Sent: Saturday, August 30, 2014 21:29
To: Terry Moore
Subject: Re: FW: ixg(4) performances
Doesn't anyone read my posts or, more important, the PCIe spec?
2.5 Giga TRANSFERS per second.

I'm not sure I understand what you're saying.
"Signaling rate - Once initialized, each Link must only operate at one of
the supported signaling
levels. For the first generation of PCI Express technology, there is only
one signaling rate
defined, which provides an effective 2.5 Gigabits/second/Lane/direction of
raw bandwidth.
The second generation provides an effective 5.0
Gigabits/second/Lane/direction of raw
bandwidth. The third generation provides an effective 8.0
Gigabits/second/Lane/direction of
10 raw bandwidth. The data rate is expected to increase with technology
advances in the future."
This is not 2.5G Transfers per second. PCIe talks about transactions rather
than transfers; one transaction requires either 12 bytes (for 32-bit
systems) or 16 bytes (for 64-bit systems) of overhead at the transaction
layer, plus 7 bytes at the link layer.
The maximum number of transactions per second paradoxically transfers the
fewest number of bytes; a 4K write takes 16+4096+5+2 byte times, and so only
about 60,000 such transactions are possible per second (moving about
248,000,000 bytes). [Real systems don't see this, quite -- Wikipedia claims,
for example 95% efficiency is typical for storage controllers.]
A 4-byte write takes 16+4+5+2 byte times, and so roughly 9 million
transactions are possible per second, but those 9 million transactions can
only move 36 million bytes.
Multiple lanes scale things fairly linearly. But there has to be one byte
per lane; a x8 configuration says that physical transfers are padded so that
each the 4-byte write (which takes 27 bytes on the bus) will have to take 32
bytes. Instead of getting 72 million transactions per second, you get 62.5
million transactions/second, so it doesn't scale as nicely.
Reads are harder to analyze, because they depend on the speed and design of
both ends of the link. The reader sends a read request packet, and the
read-responder (some time later) sends back the response.
As far as I can see, even at gen3 with lots of lanes, PCIe doesn't scale to
2.5 G transfers per second.
Best regards,
--Terry

--
Hisashi T Fujinaka - ***@twofifty.com
BSEE + BSChem + BAEnglish + MSCS + $2.50 = coffee

Hisashi T Fujinaka

2014-08-31 18:44:30 UTC

Oh, and to answer the actual first, relevant question, I can try finding
out if we (day job, 82599) can do line rate at 2.5GT/s. I think we can
get a lot closer than you're getting but we don't test with NetBSD.

--
Hisashi T Fujinaka - ***@twofifty.com
BSEE + BSChem + BAEnglish + MSCS + $2.50 = coffee

Terry Moore

2014-09-03 17:58:30 UTC

-----Original Message-----
Sent: Sunday, August 31, 2014 12:39
To: Terry Moore
Subject: RE: FW: ixg(4) performances
I may be wrong in the transactions/transfers. However, I think you're
reading the page incorrectly. The signalling rate is the physical speed of
the link. On top of that is the 8/10 encoding (the Ethernet controller
we're talking about is only Gen 2), the framing, etc, and the spec
discusses the data rate in GT/s. Gb/s means nothing.

Hi,

I see what the dispute is. The PCIe 3.0 spec *nowhere* uses "transfers" in
the sense of UNIT INTERVALs (it uses UNIT INTERVAL). The word "transfer" is
not used in that way in the spec. Transfer is used, mostly in the sense of
a larger message. It's not in the glossary, etc.

However, you are absolutely right, many places in the industry (Intel,
Wikipedia, etc.) refer to 2.5GT/s to mean 2.5G Unit Intervals/Second; and
it's common enough that it's the de-facto standard terminology. I only deal
with the spec, and don't pay that much attention to the ancillary material.

We still disagree about something: GT/s does mean *something* in terms of
the raw throughput of the link. It tells the absolute upper limit of the
channel (ignoring protocol overhead). If the upper limit is too low, you can
stop worrying about fine points. If you're only trying to push 10% (for
example) and you're not getting it, you look for protocol problems. At 50%,
you say "hmm, we could be hitting the channel capacity".

I think we were violently agreeing, because the technical content of what I
was writing was (modulo possible typos) identical to 2.5GT/s. 2.5GT/s is
(after 8/10 encoding) a max raw data rate of 250e6 bytes/sec, and when you
go through things and account for overhead, it's very possible that 8 lanes
(max 2e9 bytes/sec) of gen1 won't be fast enough for 10G Eth.

Best regards,
--Terry

David Laight

2014-10-01 19:59:37 UTC

Post by Terry Moore
This is not 2.5G Transfers per second. PCIe talks about transactions rather
than transfers; one transaction requires either 12 bytes (for 32-bit
systems) or 16 bytes (for 64-bit systems) of overhead at the transaction
layer, plus 7 bytes at the link layer.
The maximum number of transactions per second paradoxically transfers the
fewest number of bytes; a 4K write takes 16+4096+5+2 byte times, and so only
about 60,000 such transactions are possible per second (moving about
248,000,000 bytes). [Real systems don't see this, quite -- Wikipedia claims,
for example 95% efficiency is typical for storage controllers.]

The gain for large transfer requests is probably minimal.
There can be multiple requests outstanding at any one time (the limit
is negotiated, I'm guessing that 8 and 16 are typical values).
A typical PCIe dma controller will generate multiple concurrent transfer
requests, so even if the requests are only 128 bytes you can get a
reasonable overall throughput.

Post by Terry Moore
A 4-byte write takes 16+4+5+2 byte times, and so roughly 9 million
transactions are possible per second, but those 9 million transactions can
only move 36 million bytes.

Except that nothing will generate adequately overlapped short transfers.

The real performance killer is cpu pio cycles.
Every one that the driver does will hit the throughput - the cpu will
be spinning for a long, long time (think ISA bus speeds).

A side effect of this is that PCI-PCIe bridges (either way) are doomed
to be very inefficient.

Post by Terry Moore
Multiple lanes scale things fairly linearly. But there has to be one byte
per lane; a x8 configuration says that physical transfers are padded so that
each the 4-byte write (which takes 27 bytes on the bus) will have to take 32
bytes. Instead of getting 72 million transactions per second, you get 62.5
million transactions/second, so it doesn't scale as nicely.

I think that individual PCIe transfers requests always use a single lane.
Multiple lanes help if you have multiple concurrent transfers.
So different chunks of an ethernet frame can be transferred in parrallel
over multiple lanes, with the transfer not completing until all the
individual parts complete.
So the ring status transfer can't be scheduled until all the other
data fragment transfers have completed.

I also believe that the PCIe transfers are inherently 64bit.
There are byte-enables indicating which bytes of the first and last
64bit words are actually required.

The real thing to remember about PCIe is that it is a comms protocol,
not a bus protocol.
It is high throughput, high latency.

I've had 'fun' getting even moderate PCIe throughput into an fpga.

David

--
David Laight: ***@l8s.co.uk

Emmanuel Dreyfus

2014-09-01 02:10:17 UTC

Post by Terry Moore
Since you did a dword read, the extra 0x9.... is the device status register.
This makes me suspicious as the device status register is claiming that you
have "unsupported request detected)" [bit 3] and "correctable error
detected" [bit 0]. Further, this register is RW1C for all these bits -- so
when you write 94810, it should have cleared the 9 (so a subsequent read
should have returned 4810).
Please check.

You are right;
# pcictl /dev/pci5 read -d 0 -f 1 0xa8
00092810
# pcictl /dev/pci5 write -d 0 -f 1 0xa8 0x00094810
# pcictl /dev/pci5 read -d 0 -f 1 0xa8
00004810

Post by Terry Moore
Might be good to post a "pcictl dump" of your device, just to expose all the
details.

It explicitely says 2.5 Gb/s x 8 lanes

# pcictl /dev/pci5 dump -d0 -f 1
PCI configuration registers:
Common header:
0x00: 0x10fb8086 0x00100107 0x02000001 0x00800010

Vendor Name: Intel (0x8086)
Device Name: 82599 (SFI/SFP+) 10 GbE Controller (0x10fb)
Command register: 0x0107
I/O space accesses: on
Memory space accesses: on
Bus mastering: on
Special cycles: off
MWI transactions: off
Palette snooping: off
Parity error checking: off
Address/data stepping: off
System error (SERR): on
Fast back-to-back transactions: off
Interrupt disable: off
Status register: 0x0010
Interrupt status: inactive
Capability List support: on
66 MHz capable: off
User Definable Features (UDF) support: off
Fast back-to-back capable: off
Data parity error detected: off
DEVSEL timing: fast (0x0)
Slave signaled Target Abort: off
Master received Target Abort: off
Master received Master Abort: off
Asserted System Error (SERR): off
Parity error detected: off
Class Name: network (0x02)
Subclass Name: ethernet (0x00)
Interface: 0x00
Revision ID: 0x01
BIST: 0x00
Header Type: 0x00+multifunction (0x80)
Latency Timer: 0x00
Cache Line Size: 0x10

Type 0 ("normal" device) header:
0x10: 0xdfe8000c 0x00000000 0x0000bc01 0x00000000
0x20: 0xdfe7c00c 0x00000000 0x00000000 0x00038086
0x30: 0x00000000 0x00000040 0x00000000 0x00000209

Base address register at 0x10
type: 64-bit prefetchable memory
base: 0x00000000dfe80000, not sized
Base address register at 0x18
type: i/o
base: 0x0000bc00, not sized
Base address register at 0x1c
not implemented(?)
Base address register at 0x20
type: 64-bit prefetchable memory
base: 0x00000000dfe7c000, not sized
Cardbus CIS Pointer: 0x00000000
Subsystem vendor ID: 0x8086
Subsystem ID: 0x0003
Expansion ROM Base Address: 0x00000000
Capability list pointer: 0x40
Reserved @ 0x38: 0x00000000
Maximum Latency: 0x00
Minimum Grant: 0x00
Interrupt pin: 0x02 (pin B)
Interrupt line: 0x09

Capability register at 0x40
type: 0x01 (Power Management, rev. 1.0)
Capability register at 0x50
type: 0x05 (MSI)
Capability register at 0x70
type: 0x11 (MSI-X)
Capability register at 0xa0
type: 0x10 (PCI Express)

PCI Message Signaled Interrupt
Message Control register: 0x0180
MSI Enabled: no
Multiple Message Capable: no (1 vector)
Multiple Message Enabled: off (1 vector)
64 Bit Address Capable: yes
Per-Vector Masking Capable: yes
Message Address (lower) register: 0x00000000
Message Address (upper) register: 0x00000000
Message Data register: 0x00000000
Vector Mask register: 0x00000000
Vector Pending register: 0x00000000

PCI Power Management Capabilities Register
Capabilities register: 0x4823
Version: 1.2
PME# clock: off
Device specific initialization: on
3.3V auxiliary current: self-powered
D1 power management state support: off
D2 power management state support: off
PME# support: 0x09
Control/status register: 0x2000
Power state: D0
PCI Express reserved: off
No soft reset: off
PME# assertion disabled
PME# status: off

PCI Express Capabilities Register
Capability version: 2
Device type: PCI Express Endpoint device
Interrupt Message Number: 0
Link Capabilities Register: 0x00027482
Maximum Link Speed: unknown 2 value
Maximum Link Width: x8 lanes
Port Number: 0
Link Status Register: 0x1081
Negotiated Link Speed: 2.5Gb/s
Negotiated Link Width: x8 lanes

Device-dependent header:
0x40: 0x48235001 0x2b002000 0x00000000 0x00000000
0x50: 0x01807005 0x00000000 0x00000000 0x00000000
0x60: 0x00000000 0x00000000 0x00000000 0x00000000
0x70: 0x003fa011 0x00000004 0x00002004 0x00000000
0x80: 0x00000000 0x00000000 0x00000000 0x00000000
0x90: 0x00000000 0x00000000 0x00000000 0x00000000
0xa0: 0x00020010 0x10008cc2 0x00004810 0x00027482
0xb0: 0x10810000 0x00000000 0x00000000 0x00000000
0xc0: 0x00000000 0x0000001f 0x00000000 0x00000000
0xd0: 0x00000000 0x00000000 0x00000000 0x00000000
0xe0: 0x00000000 0x00000000 0x00000000 0x00000000
0xf0: 0x00000000 0x00000000 0x00000000 0x00000000

--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
***@netbsd.org

Masanobu SAITOH

2014-09-01 05:31:04 UTC

Hi, Emmanuel.

Post by Emmanuel Dreyfus

Post by Terry Moore
Since you did a dword read, the extra 0x9.... is the device status register.
This makes me suspicious as the device status register is claiming that you
have "unsupported request detected)" [bit 3] and "correctable error
detected" [bit 0]. Further, this register is RW1C for all these bits -- so
when you write 94810, it should have cleared the 9 (so a subsequent read
should have returned 4810).
Please check.

You are right;
# pcictl /dev/pci5 read -d 0 -f 1 0xa8
00092810
# pcictl /dev/pci5 write -d 0 -f 1 0xa8 0x00094810
# pcictl /dev/pci5 read -d 0 -f 1 0xa8
00004810

Post by Terry Moore
Might be good to post a "pcictl dump" of your device, just to expose all the
details.

It explicitely says 2.5 Gb/s x 8 lanes
# pcictl /dev/pci5 dump -d0 -f 1
0x00: 0x10fb8086 0x00100107 0x02000001 0x00800010
Vendor Name: Intel (0x8086)
Device Name: 82599 (SFI/SFP+) 10 GbE Controller (0x10fb)
Command register: 0x0107
I/O space accesses: on
Memory space accesses: on
Bus mastering: on
Special cycles: off
MWI transactions: off
Palette snooping: off
Parity error checking: off
Address/data stepping: off
System error (SERR): on
Fast back-to-back transactions: off
Interrupt disable: off
Status register: 0x0010
Interrupt status: inactive
Capability List support: on
66 MHz capable: off
User Definable Features (UDF) support: off
Fast back-to-back capable: off
Data parity error detected: off
DEVSEL timing: fast (0x0)
Slave signaled Target Abort: off
Master received Target Abort: off
Master received Master Abort: off
Asserted System Error (SERR): off
Parity error detected: off
Class Name: network (0x02)
Subclass Name: ethernet (0x00)
Interface: 0x00
Revision ID: 0x01
BIST: 0x00
Header Type: 0x00+multifunction (0x80)
Latency Timer: 0x00
Cache Line Size: 0x10
0x10: 0xdfe8000c 0x00000000 0x0000bc01 0x00000000
0x20: 0xdfe7c00c 0x00000000 0x00000000 0x00038086
0x30: 0x00000000 0x00000040 0x00000000 0x00000209
Base address register at 0x10
type: 64-bit prefetchable memory
base: 0x00000000dfe80000, not sized
Base address register at 0x18
type: i/o
base: 0x0000bc00, not sized
Base address register at 0x1c
not implemented(?)
Base address register at 0x20
type: 64-bit prefetchable memory
base: 0x00000000dfe7c000, not sized
Cardbus CIS Pointer: 0x00000000
Subsystem vendor ID: 0x8086
Subsystem ID: 0x0003
Expansion ROM Base Address: 0x00000000
Capability list pointer: 0x40
Maximum Latency: 0x00
Minimum Grant: 0x00
Interrupt pin: 0x02 (pin B)
Interrupt line: 0x09
Capability register at 0x40
type: 0x01 (Power Management, rev. 1.0)
Capability register at 0x50
type: 0x05 (MSI)
Capability register at 0x70
type: 0x11 (MSI-X)
Capability register at 0xa0
type: 0x10 (PCI Express)
PCI Message Signaled Interrupt
Message Control register: 0x0180
MSI Enabled: no
Multiple Message Capable: no (1 vector)
Multiple Message Enabled: off (1 vector)
64 Bit Address Capable: yes
Per-Vector Masking Capable: yes
Message Address (lower) register: 0x00000000
Message Address (upper) register: 0x00000000
Message Data register: 0x00000000
Vector Mask register: 0x00000000
Vector Pending register: 0x00000000
PCI Power Management Capabilities Register
Capabilities register: 0x4823
Version: 1.2
PME# clock: off
Device specific initialization: on
3.3V auxiliary current: self-powered
D1 power management state support: off
D2 power management state support: off
PME# support: 0x09
Control/status register: 0x2000
Power state: D0
PCI Express reserved: off
No soft reset: off
PME# assertion disabled
PME# status: off
PCI Express Capabilities Register
Capability version: 2
Device type: PCI Express Endpoint device
Interrupt Message Number: 0
Link Capabilities Register: 0x00027482
Maximum Link Speed: unknown 2 value
Maximum Link Width: x8 lanes
Port Number: 0
Link Status Register: 0x1081
Negotiated Link Speed: 2.5Gb/s

*

Which Version of NetBSD are you using?

I committed some changes fixing Gb/s to GT/s in pci_sbur.c.
It was in April, 2013. I suspect you are using netbsd-6, or
you are using -current with old /usr/lib/libpci.so.

Post by Emmanuel Dreyfus
Negotiated Link Width: x8 lanes
0x40: 0x48235001 0x2b002000 0x00000000 0x00000000
0x50: 0x01807005 0x00000000 0x00000000 0x00000000
0x60: 0x00000000 0x00000000 0x00000000 0x00000000
0x70: 0x003fa011 0x00000004 0x00002004 0x00000000
0x80: 0x00000000 0x00000000 0x00000000 0x00000000
0x90: 0x00000000 0x00000000 0x00000000 0x00000000
0xa0: 0x00020010 0x10008cc2 0x00004810 0x00027482
0xb0: 0x10810000 0x00000000 0x00000000 0x00000000
0xc0: 0x00000000 0x0000001f 0x00000000 0x00000000
0xd0: 0x00000000 0x00000000 0x00000000 0x00000000
0xe0: 0x00000000 0x00000000 0x00000000 0x00000000
0xf0: 0x00000000 0x00000000 0x00000000 0x00000000

--
-----------------------------------------------
SAITOH Masanobu (***@execsw.org
***@netbsd.org)

Terry Moore

2014-09-03 18:12:04 UTC

From Emmanuel Dreyfus
You are right;
# pcictl /dev/pci5 read -d 0 -f 1 0xa8
00092810
# pcictl /dev/pci5 write -d 0 -f 1 0xa8 0x00094810
# pcictl /dev/pci5 read -d 0 -f 1 0xa8
00004810

That's reassuring. The dump confirms that we're looking at the right
registers, thank you.

As I read the spec, 0x4810 in the Device Control Register means:

Max_Read_Request_Size: 100b -> 4096 bytes
Enable_No_Snoop: 1
Max_Payload_Size: 000b --> 128 bytes
Enable_Relaxed_Ordering: 1

All other options turned off.

I think you should try:

Max_Read_Request_Size: 100b -> 4096 bytes
Enable_No_Snoop: 1
Max_Payload_Size: 100b --> 4096 bytes
Enable_Relaxed_Ordering: 1

This would give 0x4890 as the value, not 0x4810.

It's odd that the BIOS set the max_payload_size to 000b. It's possible that
this indicates that the root complex has some limitations. Or it could be a
buggy or excessively conservative BIOS. ("It's safer to program Add-In
boards conservatively -- fewer support calls due to dead systems." Or
something like that.)

So you may have to experiment. This would explain that you saw 2.5 GB/sec
before, and 2.7 GB/sec after -- you increased the max *read* size, but not
the max *write* size. Increasing from 2048 to 4096 would improve read
throughput but not enormously. Depends, of course, on your benchmark.

--Terry

Thor Lancelot Simon

2014-08-30 07:22:54 UTC

Post by Terry Moore
Is the ixg in an expansion slot or integrated onto the main board?

If you know where to get a mainboard with an integrated ixg, I wouldn't
mind hearing about it.

Thor

Justin Cormack

2014-08-30 09:24:52 UTC

Post by Thor Lancelot Simon

Post by Terry Moore
Is the ixg in an expansion slot or integrated onto the main board?

If you know where to get a mainboard with an integrated ixg, I wouldn't
mind hearing about it.

They are starting to appear, eg
http://www.supermicro.co.uk/products/motherboard/Xeon/C600/X9SRH-7TF.cfm

Justin

Bert Kiers

2014-09-03 14:11:29 UTC

Post by Justin Cormack

Post by Thor Lancelot Simon

Post by Terry Moore
Is the ixg in an expansion slot or integrated onto the main board?

If you know where to get a mainboard with an integrated ixg, I wouldn't
mind hearing about it.

They are starting to appear, eg
http://www.supermicro.co.uk/products/motherboard/Xeon/C600/X9SRH-7TF.cfm

We have some SuperMicro X9DRW with 10 GbE Intel NICs on board. NetBSD
current does not configure them.

NetBSD 6.1 says:

vendor 0x8086 product 0x1528 (ethernet network, revision 0x01) at pci1 dev 0 function 0 not configured

Complete messages: http://netbsd.itsx.net/hw/x9drw.dmesg

NetBSD current from today also does not configure them.

(I boot from an USB-stick with current kernel and 6.1 userland. It wants
to mount sda0 but there is only dk0, dk1. So I end up with read/only
disk. I have to sort that out before I can save kernels messages.)

Grtnx,

--
B*E*R*T

Bert Kiers

2014-09-03 15:13:44 UTC

Post by Bert Kiers
vendor 0x8086 product 0x1528 (ethernet network, revision 0x01) at pci1 dev 0 function 0 not configured
Complete messages: http://netbsd.itsx.net/hw/x9drw.dmesg
NetBSD current from today also does not configure them.

Btw, FreeBSD 9.3-RELEASE says:

ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15> port 0x8020-0x803f mem 0xde200000-0xde3fffff,0xde404000-0xde407fff irq 26 at device 0.0 on pci1
ix0: Using MSIX interrupts with 9 vectors
ix0: Ethernet address: 00:25:90:f9:49:20
ix0: PCI Express Bus: Speed 5.0GT/s Width x8

--
B*E*R*T

Emmanuel Dreyfus

2014-09-03 15:40:52 UTC

Post by Bert Kiers
vendor 0x8086 product 0x1528 (ethernet network, revision 0x01) at pci1 dev 0 function 0 not configured

In src/sys/dev/pci/ixgbe/ we know about producct Id 0x1529 and 0x152A but
not 0x1528. But this can probably be easily borrowed from FreeBSD:
http://svnweb.freebsd.org/base/head/sys/dev/ixgbe/

They call it IXGBE_DEV_ID_X540T, You can try to add in ixgbe_type.h:
#define IXGBE_DEV_ID_82599_X540T 0x1528

Then in ixgbe.c add a IXGBE_DEV_ID_82599_X540T line in
ixgbe_vendor_info_array[]

In ixgbe_82599.c you need a case IXGBE_DEV_ID_82599_X540T
next to case IXGBE_DEV_ID_82599_BACKPLANE_FCOE if media
is indeed backplane. Otherwise add it at the appropriate place
in the switch statement.

And finally you need to add a case IXGBE_DEV_ID_82599_X540T
next to case IXGBE_DEV_ID_82599_BACKPLANE_FCOE in ixgbe_api.c

--
Emmanuel Dreyfus
***@netbsd.org

Masanobu SAITOH

2014-09-04 02:24:33 UTC

Post by Emmanuel Dreyfus

Post by Bert Kiers
vendor 0x8086 product 0x1528 (ethernet network, revision 0x01) at pci1 dev 0 function 0 not configured

In src/sys/dev/pci/ixgbe/ we know about producct Id 0x1529 and 0x152A but
http://svnweb.freebsd.org/base/head/sys/dev/ixgbe/
#define IXGBE_DEV_ID_82599_X540T 0x1528
Then in ixgbe.c add a IXGBE_DEV_ID_82599_X540T line in
ixgbe_vendor_info_array[]
In ixgbe_82599.c you need a case IXGBE_DEV_ID_82599_X540T
next to case IXGBE_DEV_ID_82599_BACKPLANE_FCOE if media
is indeed backplane. Otherwise add it at the appropriate place
in the switch statement.
And finally you need to add a case IXGBE_DEV_ID_82599_X540T
next to case IXGBE_DEV_ID_82599_BACKPLANE_FCOE in ixgbe_api.c

Our ixg(4) driver doesn't support X520. At least there is no
file ixgbe/ixgbe_x540.c of FreeBSD.

Bus FreeBSD NetBSD
82597 PCI-X ixgb dge
82598 PCIe ixgbe ixg
82599(X520) PCIe ixgbe ixg
X540 PCIe ixgbe (ixg)(not yet)

--
-----------------------------------------------
SAITOH Masanobu (***@execsw.org
***@netbsd.org)

Masanobu SAITOH

2014-09-04 02:31:07 UTC

Post by Masanobu SAITOH

Post by Emmanuel Dreyfus

Post by Bert Kiers
vendor 0x8086 product 0x1528 (ethernet network, revision 0x01) at pci1 dev 0 function 0 not configured

In src/sys/dev/pci/ixgbe/ we know about producct Id 0x1529 and 0x152A but
http://svnweb.freebsd.org/base/head/sys/dev/ixgbe/
#define IXGBE_DEV_ID_82599_X540T 0x1528
Then in ixgbe.c add a IXGBE_DEV_ID_82599_X540T line in
ixgbe_vendor_info_array[]
In ixgbe_82599.c you need a case IXGBE_DEV_ID_82599_X540T
next to case IXGBE_DEV_ID_82599_BACKPLANE_FCOE if media
is indeed backplane. Otherwise add it at the appropriate place
in the switch statement.
And finally you need to add a case IXGBE_DEV_ID_82599_X540T
next to case IXGBE_DEV_ID_82599_BACKPLANE_FCOE in ixgbe_api.c

Our ixg(4) driver doesn't support X520. At least there is no

s/X520/X540/

Post by Masanobu SAITOH
file ixgbe/ixgbe_x540.c of FreeBSD.
Bus FreeBSD NetBSD
82597 PCI-X ixgb dge
82598 PCIe ixgbe ixg
82599(X520) PCIe ixgbe ixg
X540 PCIe ixgbe (ixg)(not yet)

--
-----------------------------------------------
SAITOH Masanobu (***@execsw.org
***@netbsd.org)

Matthias Drochner

2014-08-30 13:09:09 UTC

On Fri, 29 Aug 2014 15:51:14 +0000

Post by Emmanuel Dreyfus
I found this, but the result does not make sense: negociated > max ...
Link Capabilities Ragister (0xAC): 0x00027482
bits 3:0 Supprted Link speed: 0010 = 5 GbE and 2.5 GbE speed
supported bits 9:4 Max link width: 001000 = x4

Wrong -- this means x8.

Post by Emmanuel Dreyfus
bits 14:12 L0s exit lattency: 101 = 1 µs - 2 µs
bits 17:15 L1 Exit lattency: 011 = 4 µs - 8 µs
Link Status Register (0xB2): 0x1081
bits 3:0 Current Link speed: 0001 = 2.5 GbE PCIe link
bits 9:4 Negociated link width: 001000 = x8

So it makes sense.

best regards
Matthias

------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Emmanuel Dreyfus

2014-08-30 16:02:20 UTC

Post by Matthias Drochner

Post by Emmanuel Dreyfus
Link Capabilities Ragister (0xAC): 0x00027482
bits 3:0 Supprted Link speed: 0010 = 5 GbE and 2.5 GbE speed
supported bits 9:4 Max link width: 001000 = x4

Wrong -- this means x8.

Post by Emmanuel Dreyfus
bits 14:12 L0s exit lattency: 101 = 1 µs - 2 µs
bits 17:15 L1 Exit lattency: 011 = 4 µs - 8 µs
Link Status Register (0xB2): 0x1081
bits 3:0 Current Link speed: 0001 = 2.5 GbE PCIe link
bits 9:4 Negociated link width: 001000 = x8

So it makes sense.

Right, hence the ethernet board can do 5 GbE x 8, but something in front
of it can only do 2.5 GbE x 8.

But 2.5 GbE x 8 means 20 Gb/s, which is much larger than the 2.7 Gb/s I
get.

--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
***@netbsd.org

Hisashi T Fujinaka

2014-08-28 15:37:06 UTC

Post by Emmanuel Dreyfus

Post by Emmanuel Dreyfus
http://dak1n1.com/blog/7-performance-tuning-intel-10gbe

It seems that page describe a slightly different model.
http://www.intel.fr/content/www/fr/fr/ethernet-controllers/82599-10-gbe-controller-datasheet.html
No reference to MMRBC in this document, but I understand "Max Read Request
Size" is the same thing. Page 765 tells us about register A8, bits 12-14
that should be set to 100.
pcictl /dev/pci5 read -d 0 -f 1 0x18 tells me the value 0x00092810
pcictl /dev/pci5 write -d 0 -f 1 0x18 0x00094810
Further pcictl read suggests it works as the new value is returned.
However it gives no performance improvement. This means that I
misunderstood what this register is about, or how to change it (byte order?).
Or the performance are constrained by something unrelated. In the blog
post cited above, the poster acheived more than 5 Gb/s before touching
MMRBC, while I am stuck at 2,7 GB/s. Any new idea welcome.

Isn't your PCIe slot constrained? I thought I remembered that you're
only getting 2.5GT/s and I forget what test you're running.

--
Hisashi T Fujinaka - ***@twofifty.com
BSEE + BSChem + BAEnglish + MSCS + $2.50 = coffee

Emmanuel Dreyfus

2014-08-28 15:48:23 UTC

Post by Hisashi T Fujinaka
Isn't your PCIe slot constrained? I thought I remembered that you're
only getting 2.5GT/s and I forget what test you're running.

I use netperf, and I now get 2.7 Gb/s.

--
Emmanuel Dreyfus
***@netbsd.org

Thor Lancelot Simon

2014-08-27 01:56:40 UTC

Post by Emmanuel Dreyfus
Hi
ixgb(4) has poor performances, even on latest -current. Here is the
ixg1 at pci5 dev 0 function 1: Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.3.10
ixg1: clearing prefetchable bit
ixg1: interrupting at ioapic0 pin 9
ixg1: PCI Express Bus: Speed 2.5Gb/s Width x8
ifconfig ixg1 mtu 9000 tso4 ip4csum tcp4csum-tx udp4csum-tx

MTU 9000 considered harmful. Use something that fits in 8K with the headers.
It's a minor piece of the puzzle but nonetheless, it's a piece.

Thor

Thor Lancelot Simon

2014-08-27 02:58:05 UTC

Thor,
The NetBSD TCP stack can't handle 8K payload by page-flipping the payload and prepending an mbuf for XDR/NFS/TCP/IP headers? Or is the issue the extra page-mapping for the prepended mbuf?

The issue is allocating the extra page for a milligram of data. It is almost
always a lose. Better to choose the MTU so that the whole packet fits neatly
in 8192 bytes.

It is helpful to understand where MTU 9000 came from: SGI was trying to
optimise UDP NFS performance, for NFSv2 with 8K maximum RPC size, on
systems that had 16K pages. You can't fit two of that kind of NFS request
in a 16K page, so you might as well allocate something a little bigger than
8K but that happens to leave your memory allocator some useful-sized chunks
to hand out to other callers.

I am a little hazy on the details, but I believe they ended up at MTU 9024
which is 8K + 768 + 64 (leaving a bunch of handy power-of-2 split sizes
as residuals: 4096 + 2048 + 1024 + 128 + 64) which just made no sense to
anyone else so everyone _else_ picked random sizes around 9000 that happened
to work for their hardware. But at the end of the day, if you do not have
16K pages or are not optimizing for 8K NFSv2 requests on UDP, an MTU that
fits in 8K is almost always better.

Thor

Jonathan Stone

2014-08-27 02:03:06 UTC

Thor,

The NetBSD TCP stack can't handle 8K payload by page-flipping the payload and prepending an mbuf for XDR/NFS/TCP/IP headers? Or is the issue the extra page-mapping for the prepended mbuf?

--------------------------------------------
On Tue, 8/26/14, Thor Lancelot Simon <***@panix.com> wrote:

Subject: Re: ixg(4) performances
To: "Emmanuel Dreyfus" <***@netbsd.org>
Cc: tech-***@netbsd.org
Date: Tuesday, August 26, 2014, 6:56 PM

[...]

MTU 9000 considered harmful. Use something
that fits in 8K with the headers.
It's a
minor piece of the puzzle but nonetheless, it's a
piece.

Thor

Emmanuel Dreyfus

2014-08-27 04:54:10 UTC

Post by Thor Lancelot Simon
MTU 9000 considered harmful. Use something that fits in 8K with the headers.
It's a minor piece of the puzzle but nonetheless, it's a piece.

mtu 8192 or 8000 does not cause any improvement over mtu 9000.

--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
***@netbsd.org

50 Replies
6 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Emmanuel Dreyfus 2014-08-26 12:17:28 UTC

Christos Zoulas 2014-08-26 12:57:37 UTC

Emmanuel Dreyfus 2014-08-26 14:23:19 UTC

Christos Zoulas 2014-08-26 14:25:52 UTC

Emmanuel Dreyfus 2014-08-26 14:42:55 UTC

Christos Zoulas 2014-08-26 15:13:50 UTC

Emmanuel Dreyfus 2014-08-26 15:43:05 UTC

Taylor R Campbell 2014-08-26 15:51:24 UTC

Taylor R Campbell 2014-08-26 15:40:41 UTC

Taylor R Campbell 2014-08-26 16:40:25 UTC

Emmanuel Dreyfus 2014-08-27 07:48:31 UTC

Emmanuel Dreyfus 2014-08-28 07:28:32 UTC

Christos Zoulas 2014-08-28 14:44:09 UTC

David Young 2014-08-26 17:44:43 UTC

matthew green 2014-08-26 18:50:58 UTC

Hisashi T Fujinaka 2014-08-26 21:36:55 UTC

Taylor R Campbell 2014-08-27 03:16:14 UTC

Emmanuel Dreyfus 2014-08-28 07:26:44 UTC

Stephan 2014-08-28 08:25:41 UTC

Terry Moore 2014-08-28 11:48:37 UTC

Emmanuel Dreyfus 2014-08-29 03:54:53 UTC

Terry Moore 2014-08-29 12:48:51 UTC

Emmanuel Dreyfus 2014-08-29 15:51:14 UTC

Terry Moore 2014-08-29 16:22:31 UTC

Emmanuel Dreyfus 2014-08-29 17:10:44 UTC

Hisashi T Fujinaka 2014-08-29 17:11:50 UTC

Terry Moore 2014-08-30 23:46:19 UTC

Hisashi T Fujinaka 2014-08-31 01:29:19 UTC

Terry Moore 2014-08-31 16:07:38 UTC

Hisashi T Fujinaka 2014-08-31 16:38:44 UTC

Hisashi T Fujinaka 2014-08-31 18:44:30 UTC

Terry Moore 2014-09-03 17:58:30 UTC

David Laight 2014-10-01 19:59:37 UTC

Emmanuel Dreyfus 2014-09-01 02:10:17 UTC

Masanobu SAITOH 2014-09-01 05:31:04 UTC

Terry Moore 2014-09-03 18:12:04 UTC

Thor Lancelot Simon 2014-08-30 07:22:54 UTC

Justin Cormack 2014-08-30 09:24:52 UTC

Bert Kiers 2014-09-03 14:11:29 UTC

Bert Kiers 2014-09-03 15:13:44 UTC

Emmanuel Dreyfus 2014-09-03 15:40:52 UTC

Masanobu SAITOH 2014-09-04 02:24:33 UTC

Masanobu SAITOH 2014-09-04 02:31:07 UTC

Matthias Drochner 2014-08-30 13:09:09 UTC

Emmanuel Dreyfus 2014-08-30 16:02:20 UTC

Hisashi T Fujinaka 2014-08-28 15:37:06 UTC

Emmanuel Dreyfus 2014-08-28 15:48:23 UTC

Thor Lancelot Simon 2014-08-27 01:56:40 UTC

Thor Lancelot Simon 2014-08-27 02:58:05 UTC

Jonathan Stone 2014-08-27 02:03:06 UTC

Emmanuel Dreyfus 2014-08-27 04:54:10 UTC

about - legalese

Loading...