Discussion:
Only 8 MB/sec write throughput with NetBSD 5.1 AMD64
Ravishankar S
2011-10-09 08:52:14 UTC
Permalink
Hi,

I am trying to set up an FTP/Samba file server for serving media files
at home and I am seeing only 8MB/sec disk write throughput with NBSD
5.1 AMD64. Previously Windows 7 and Linux did 30 MB+/sec on the same
setup. I just did a simple dd to check

dd if=/dev/zero of=/store/test.img bs=1000K count=1000

Setup details:

Asus M3N78-EM Mobo with NVIDIA GeForce 8300 chipset and Nvidia gigabit
ethernet (nfe driver in use)
AMD Athlon X2 2.6Ghz, 8GB DDR2
750GB WD 7400 rpm drive with 16MB hardware cache
Currently / is 40GB and I created a /store partition of 100GB size for testing
GENERIC kernel with no changes

How can I improve the disk write performance? What other data can I
collect to troubleshoot?

During installation i asked installer to format /store with FFS2 and
softdeps on (although i don't know why fstab shows just ffs??)

# NetBSD /etc/fstab
# See /usr/share/examples/fstab/ for more examples.
/dev/wd0a / ffs rw 1 1
/dev/wd0b none swap sw,dp 0 0
/dev/wd0e /store ffs rw,softdep 1 2
kernfs /kern kernfs rw
ptyfs /dev/pts ptyfs rw
procfs /proc procfs rw
/dev/cd0a /cdrom cd9660 ro,noauto

dmesg output:

Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
2006, 2007, 2008, 2009, 2010
The NetBSD Foundation, Inc. All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California. All rights reserved.

NetBSD 5.1 (GENERIC) #0: Sat Nov 6 13:19:33 UTC 2010
***@b6.netbsd.org:/home/builds/ab/netbsd-5-1-RELEASE/amd64/201011061943Z-obj/home/builds/ab/netbsd-5-1-RELEASE/src/sys/arch/amd64/compile/GENERIC
total memory = 8063 MB
avail memory = 7801 MB
timecounter: Timecounters tick every 10.000 msec
timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
SMBIOS rev. 2.5 @ 0x9f000 (71 entries)
System manufacturer System Product Name (System Version)
mainbus0 (root)
cpu0 at mainbus0 apid 0: AMD 686-class, 2600MHz, id 0x60fb2
cpu1 at mainbus0 apid 1: AMD 686-class, 2600MHz, id 0x60fb2
ioapic0 at mainbus0 apid 2: pa 0xfec00000, version 11, 24 pins
acpi0 at mainbus0: Intel ACPICA 20080321
acpi0: X/RSDT: OemId <091010,XSDT1510,20100910>, AslId <MSFT,00000097>
acpi0: SCI interrupting at int 9
acpi0: fixed-feature power button present
timecounter: Timecounter "ACPI-Fast" frequency 3579545 Hz quality 1000
ACPI-Fast 24-bit timer
pcppi1 at acpi0 (SPKR, PNP0800): io 0x61
midi0 at pcppi1: PC speaker (CPU-intensive output)
sysbeep0 at pcppi1
FDC (PNP0700) at acpi0 not configured
LPTE (PNP0400) at acpi0 not configured
aiboost0 at acpi0 (ASOC, ATK0110-16843024)
aiboost0: ASUS AI Boost Hardware monitor
hpet0 at acpi0 (HPET, PNP0103-0): mem 0xfed00000-0xfed00fff irq 2,8
timecounter: Timecounter "hpet0" frequency 25000000 Hz quality 2000
attimer1 at acpi0 (TMR, PNP0100): io 0x40-0x43
UAR1 (PNP0501) at acpi0 not configured
WMI0 (pnp0c14) at acpi0 not configured
WMI1 (pnp0c14) at acpi0 not configured
acpibut0 at acpi0 (PWRB, PNP0C0C-170): ACPI Power Button
attimer1: attached to pcppi1
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
vendor 0x10de product 0x0754 (RAM memory, revision 0xa2) at pci0 dev 0
function 0 not configured
pcib0 at pci0 dev 1 function 0
pcib0: vendor 0x10de product 0x075c (rev. 0xa2)
vendor 0x10de product 0x0752 (SMBus serial bus, revision 0xa1) at pci0
dev 1 function 1 not configured
vendor 0x10de product 0x0751 (RAM memory, revision 0xa1) at pci0 dev 1
function 2 not configured
vendor 0x10de product 0x0753 (Co-processor processor, revision 0xa2)
at pci0 dev 1 function 3 not configured
vendor 0x10de product 0x0568 (RAM memory, revision 0xa1) at pci0 dev 1
function 4 not configured
ohci0 at pci0 dev 2 function 0: vendor 0x10de product 0x077b (rev. 0xa1)
LUB0: Picked IRQ 20 with weight 0
ohci0: interrupting at ioapic0 pin 20
ohci0: OHCI version 1.0, legacy support
usb0 at ohci0: USB revision 1.0
ehci0 at pci0 dev 2 function 1: vendor 0x10de product 0x077c (rev. 0xa1)
LUB2: Picked IRQ 21 with weight 0
ehci0: interrupting at ioapic0 pin 21
ehci0: BIOS has given up ownership
ehci0: EHCI version 1.0
ehci0: companion controller, 15 ports each: ohci0
usb1 at ehci0: USB revision 2.0
ohci1 at pci0 dev 4 function 0: vendor 0x10de product 0x077d (rev. 0xa1)
UB11: Picked IRQ 22 with weight 0
ohci1: interrupting at ioapic0 pin 22
ohci1: OHCI version 1.0, legacy support
usb2 at ohci1: USB revision 1.0
ehci1 at pci0 dev 4 function 1: vendor 0x10de product 0x077e (rev. 0xa1)
UB12: Picked IRQ 23 with weight 0
ehci1: interrupting at ioapic0 pin 23
ehci1: BIOS refuses to give up ownership, using force
ehci1: EHCI version 1.0
ehci1: companion controller, 15 ports each: ohci1
usb3 at ehci1: USB revision 2.0
viaide0 at pci0 dev 6 function 0
viaide0: NVIDIA MCP77 IDE Controller (rev. 0xa1)
viaide0: bus-master DMA support present
viaide0: primary channel configured to compatibility mode
viaide0: primary channel interrupting at ioapic0 pin 14
atabus0 at viaide0 channel 0
viaide0: secondary channel configured to compatibility mode
viaide0: secondary channel ignored (disabled)
azalia0 at pci0 dev 7 function 0: Generic High Definition Audio Controller
LAZA: Picked IRQ 20 with weight 1
azalia0: interrupting at ioapic0 pin 20
azalia0: host: 0x10de/0x0774 (rev. 161), HDA rev. 1.0
ppb0 at pci0 dev 8 function 0: vendor 0x10de product 0x075a (rev. 0xa1)
pci1 at ppb0 bus 1
pci1: no spaces enabled!
pciide0 at pci0 dev 9 function 0
pciide0: vendor 0x10de product 0x0ad0 (rev. 0xa2)
pciide0: bus-master DMA support present, but unused (no driver support)
pciide0: primary channel wired to native-PCI mode
LSA0: Picked IRQ 21 with weight 1
pciide0: using ioapic0 pin 21 for native-PCI interrupt
atabus1 at pciide0 channel 0
pciide0: secondary channel wired to native-PCI mode
atabus2 at pciide0 channel 1
nfe0 at pci0 dev 10 function 0: vendor 0x10de product 0x0760 (rev. 0xa2)
LMAC: Picked IRQ 22 with weight 1
nfe0: interrupting at ioapic0 pin 22
nfe0: Ethernet address 00:23:54:4f:d2:82
rgephy0 at nfe0 phy 3: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 2
rgephy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-FDX, auto
ppb1 at pci0 dev 11 function 0: vendor 0x10de product 0x0569 (rev. 0xa1)
pci2 at ppb1 bus 2
pci2: i/o space, memory space enabled, rd/line, wr/inv ok
vga0 at pci2 dev 0 function 0: vendor 0x10de product 0x0848 (rev. 0xa2)
wsdisplay0 at vga0 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
drm at vga0 not configured
ppb2 at pci0 dev 16 function 0: vendor 0x10de product 0x0778 (rev. 0xa1)
ppb2: unsupported PCI Express version
pci3 at ppb2 bus 3
pci3: no spaces enabled!
ppb3 at pci0 dev 18 function 0: vendor 0x10de product 0x075b (rev. 0xa1)
pci4 at ppb3 bus 4
pci4: no spaces enabled!
ppb4 at pci0 dev 19 function 0: vendor 0x10de product 0x077a (rev. 0xa1)
pci5 at ppb4 bus 5
pci5: memory space enabled, rd/line, wr/inv ok
fwohci0 at pci5 dev 0 function 0: vendor 0x197b product 0x2380 (rev. 0x00)
LN3A: Picked IRQ 16 with weight 0
fwohci0: interrupting at ioapic0 pin 16
fwohci0: OHCI version 1.10 (ROM=0)
fwohci0: No. of Isochronous channels is 4.
fwohci0: EUI64 00:1e:8c:00:01:98:8c:a1
fwohci0: Phy 1394a available S400, 2 ports.
fwohci0: fwphy_rddata: 0x5 loop=100, retry=100
fwohci0: fwphy_rddata: 0x2 loop=100, retry=100
fwohci0: Link S400, max_rec 2048 bytes.
ieee1394if0 at fwohci0: IEEE1394 bus
fwip0 at ieee1394if0: IP over IEEE1394
fwohci0: Initiate bus reset
fwohci0: fwphy_rddata: 0x1 loop=100, retry=100
fwohci0: fwphy_rddata: 0x1 loop=100, retry=100
pchb0 at pci0 dev 24 function 0
pchb0: vendor 0x1022 product 0x1100 (rev. 0x00)
pchb1 at pci0 dev 24 function 1
pchb1: vendor 0x1022 product 0x1101 (rev. 0x00)
pchb2 at pci0 dev 24 function 2
pchb2: vendor 0x1022 product 0x1102 (rev. 0x00)
amdtemp0 at pci0 dev 24 function 3
amdtemp0: AMD CPU Temperature Sensors (K8: core rev BH-G2, socket AM2)
isa0 at pcib0
lpt0 at isa0 port 0x378-0x37b irq 7
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
fdc0 at isa0 port 0x3f0-0x3f7 irq 6 drq 2
timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
azalia0: codec[0]: Realtek ALC888 (rev. 1.1), HDA rev. 1.0
azalia0: codec[3]: 0x10de/0x0002 (rev. 0.0), HDA rev. 1.0
audio0 at azalia0: full duplex, playback, capture, independent
fd0 at fdc0 drive 0: 1.44MB, 80 cyl, 2 head, 18 sec
uhub0 at usb0: vendor 0x10de OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 6 ports with 6 removable, self powered
uhub1 at usb1: vendor 0x10de EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub1: 6 ports with 6 removable, self powered
uhub2 at usb2: vendor 0x10de OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 6 ports with 6 removable, self powered
uhub3 at usb3: vendor 0x10de EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub3: 6 ports with 6 removable, self powered
wd0 at atabus1 drive 0: <ST3750640AS>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 698 GB, 1453518 cyl, 16 head, 63 sec, 512 bytes/sect x 1465147055 sectors
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd1 at atabus1 drive 1: <WDC WD15EARS-00MVWB0>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 1397 GB, 2907021 cyl, 16 head, 63 sec, 512 bytes/sect x 2930277168 sectors
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd2 at atabus2 drive 0: <WDC WD10EARS-00MVWB0>
wd2: drive supports 16-sector PIO transfers, LBA48 addressing
wd2: 931 GB, 1938021 cyl, 16 head, 63 sec, 512 bytes/sect x 1953525168 sectors
wd2: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
umass0 at uhub3 port 3 configuration 1 interface 0
umass0: USB2.0 External Mass Storage Device, rev 2.00/1.60, addr 2
umass0: using SCSI over Bulk-Only
scsibus0 at umass0: 2 targets, 1 lun per target
cd0 at scsibus0 target 0 lun 0: <slimtype, eSAU208 3, YL09> cdrom removable
Kernelized RAIDframe activated
pad0: outputs: 44100Hz, 16-bit, stereo
audio1 at pad0: half duplex, playback, capture
boot device: wd0
root on wd0a dumps on wd0b
root file system type: ffs
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)

Thanks,
Ravi
Joerg Sonnenberger
2011-10-09 11:47:31 UTC
Permalink
Post by Ravishankar S
I am trying to set up an FTP/Samba file server for serving media files
at home and I am seeing only 8MB/sec disk write throughput with NBSD
5.1 AMD64. Previously Windows 7 and Linux did 30 MB+/sec on the same
setup. I just did a simple dd to check
dd if=/dev/zero of=/store/test.img bs=1000K count=1000
What is the result of "dkctl wd0 getcache"?

Joerg
Manuel Bouyer
2011-10-09 12:08:48 UTC
Permalink
Post by Ravishankar S
Hi,
I am trying to set up an FTP/Samba file server for serving media files
at home and I am seeing only 8MB/sec disk write throughput with NBSD
5.1 AMD64. Previously Windows 7 and Linux did 30 MB+/sec on the same
setup. I just did a simple dd to check
dd if=/dev/zero of=/store/test.img bs=1000K count=1000
pciide0 at pci0 dev 9 function 0
pciide0: vendor 0x10de product 0x0ad0 (rev. 0xa2)
pciide0: bus-master DMA support present, but unused (no driver support)
pciide0: primary channel wired to native-PCI mode
NetBSD 5.1 is missing the proper entry for your IDE controller,
as a result it's used in PIO mode.
A 5.1_STABLE kernel should probe is as an ahci controller and performances
will be much better.
--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
Ravishankar S
2011-10-09 15:08:50 UTC
Permalink
Post by Manuel Bouyer
Post by Ravishankar S
Hi,
I am trying to set up an FTP/Samba file server for serving media files
at home and I am seeing only 8MB/sec disk write throughput with NBSD
5.1 AMD64. Previously Windows 7 and Linux did 30 MB+/sec on the same
setup. I just did a simple dd to check
dd if=/dev/zero of=/store/test.img bs=1000K count=1000
pciide0 at pci0 dev 9 function 0
pciide0: vendor 0x10de product 0x0ad0 (rev. 0xa2)
pciide0: bus-master DMA support present, but unused (no driver support)
pciide0: primary channel wired to native-PCI mode
NetBSD 5.1 is missing the proper entry for your IDE controller,
as a result it's used in PIO mode.
A 5.1_STABLE kernel should probe is as an ahci controller and performances
will be much better.
--
    NetBSD: 26 ans d'experience feront toujours la difference
--
Thanks to all who replied. dd now shows 67 MB/sec!!. However, i have
another question. I can now ftp @ 25 MB/sec and it never increases
beyond this point. I have a dlink gigabit router and gigabit cards on
both ends. client is an ubuntu laptop

Any other areas to explore to increase throughput? I see MTU set at
both ends at 1500 - any benefits to increasing it to 3000 - this will
enable jumbo frames automatically??
Manuel Bouyer
2011-10-09 16:05:02 UTC
Permalink
Post by Ravishankar S
Thanks to all who replied. dd now shows 67 MB/sec!!. However, i have
beyond this point. I have a dlink gigabit router and gigabit cards on
both ends. client is an ubuntu laptop
What does ifconfig -a shows ? You can also test network speed with
a tool like ttcp, is't more usefull to explore network performances
than ftp (which mixes performances of different subsystems).

Note that not all gigabit network adapters or switches can support a
sustainted link-rate traffic, but 25MB/s looks low anyway
Post by Ravishankar S
Any other areas to explore to increase throughput? I see MTU set at
both ends at 1500 - any benefits to increasing it to 3000 - this will
enable jumbo frames automatically??
It should.
--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
Tony Bourke
2011-10-09 19:19:59 UTC
Permalink
changing MTU won't help really. it would also cause lots of other problems (like fragmentation over your Internet connection) and every device on your deity would need to be set for the same MTU. Plus you should be able to push way more bandwidth without it. higher MTUs used to be a way to increase performance for things like iSCSI but that's not really the case anymore, even with 10 Gigabit Ethernet. So even 10 gigabit environments choose not to do jumbo frames.

If your network card is PCI, your bandwidth will be limited to about 200-250 megabits per second (which is about 25-35 MBytes/s) because of the bus speed. PCIe will handle full duplex 1 gigabit no problem.

Tony




Sent from my iPhone
Post by Manuel Bouyer
Post by Ravishankar S
Thanks to all who replied. dd now shows 67 MB/sec!!. However, i have
beyond this point. I have a dlink gigabit router and gigabit cards on
both ends. client is an ubuntu laptop
What does ifconfig -a shows ? You can also test network speed with
a tool like ttcp, is't more usefull to explore network performances
than ftp (which mixes performances of different subsystems).
Note that not all gigabit network adapters or switches can support a
sustainted link-rate traffic, but 25MB/s looks low anyway
Post by Ravishankar S
Any other areas to explore to increase throughput? I see MTU set at
both ends at 1500 - any benefits to increasing it to 3000 - this will
enable jumbo frames automatically??
It should.
--
NetBSD: 26 ans d'experience feront toujours la difference
--
Thor Lancelot Simon
2011-10-09 21:12:02 UTC
Permalink
Post by Tony Bourke
changing MTU won't help really.
Do you have data on which to base this conclusioN?
Post by Tony Bourke
it would also cause lots of other problems (like fragmentation over your
Internet connection)
No. That is what path MTU discovery is for.
Post by Tony Bourke
and every device on your deity would need to be set for the same MTU.
I am not sure what deity is involved with this, except perhaps for the god
of unclear thinking (how many times have I tried to get him out of my
personal pantheon? it's not working)
Post by Tony Bourke
Plus you should be able to push way more bandwidth without it.
With a laptop disk drive as the data sink?
Post by Tony Bourke
higher MTUs used to be a way to increase performance for things like
iSCSI but that's not really the case anymore, even with 10 Gigabit
Ethernet. So even 10 gigabit environments choose not to do jumbo frames.
If you say so. Actually, network adapters got smart enough to make little
packets look like big packets to the OS (segmentation offload on send,
"Large Receive" on receive) but since we don't know what "Dlink Gigabit
adapter" is in use here, I have to question whether the former hardware
optimization is in use, and I can be sure the latter is not since NetBSD
does not presently support it with any network adapter.
Post by Tony Bourke
If your network card is PCI, your bandwidth will be limited to about
200-250 megabits per second (which is about 25-35 MBytes/s) because
of the bus speed.
The claim is false. Even giving a conservative 80MByte/sec estimate for
33MHz 32-bit PCI (the theoretical maximum is 132MByte/sec), and allowing
for some adapter overhead, that's well north of 500 megabits per second
-- and, in fact, 400-500 megabits per second can easily be achieved with
such adapters, under NetBSD. Clearly, either 64 bit or 66MHz PCI (never
mind 64 bit, 133Mhz PCI-X) is sufficient to fill a gigabit link or come
very, very close (and indeed it is easy to reproduce this result with
real hardware under NetBSD as well).
Post by Tony Bourke
PCIe will handle full duplex 1 gigabit no problem.
The claim is true.

Thor
Tony Bourke
2011-10-09 22:32:26 UTC
Permalink
Hi All,
Post by Thor Lancelot Simon
Post by Tony Bourke
changing MTU won't help really.
Do you have data on which to base this conclusioN?
It's been long assumed that to get better throughput, enable jumbo
frames (some switches support up to 12,000 byte frames, although jumbo
frames is normally limited to 9,000 bytes). While that may have been
true in the past (to tell the truth, I'd always parroted the assumption
myself, never having tested it or seen any evidence), it doesn't seem to
be true today.

Here are a couple of benchmarks to support that conclusion:


Jason Boche tests 9,000 byte frames with iSCSI, NFS and vMotion. The
results were mixed, but none of the results showed Jumbo frames having a
huge impact (best increase in performance was 7% for one metric, rest
were nearly even).
http://www.boche.net/blog/index.php/2011/01/24/jumbo-frames-comparison-testing-with-ip-storage-and-vmotion/

NetApp and VMware also do joint performance reports comparing various
protocols. They tested NFS and iSCSI with and without Jumbo frames
(jumbo frames actually hurt iSCSI performance a bit). Overall, not a
huge difference:

http://media.netapp.com/documents/tr-3808.pdf
Post by Thor Lancelot Simon
Post by Tony Bourke
it would also cause lots of other problems (like fragmentation over your
Internet connection)
No. That is what path MTU discovery is for.
MTU discovery doesn't work reliably. All you need is a site or admin
along the path to disable ICMP.

http://en.wikipedia.org/wiki/Path_MTU_Discovery

Given that there's no real performance gain, and the problems that it
can potentially cause with MTU mismatches, more and more networking
administrators are choosing to leave it off these days.
Post by Thor Lancelot Simon
Post by Tony Bourke
and every device on your deity would need to be set for the same MTU.
I am not sure what deity is involved with this, except perhaps for the god
of unclear thinking (how many times have I tried to get him out of my
personal pantheon? it's not working)
Sorry, I typed that on my phone while in a cab. Should have said "device".
Post by Thor Lancelot Simon
Post by Tony Bourke
Plus you should be able to push way more bandwidth without it.
With a laptop disk drive as the data sink?
He'd been able to get higher throughput on Linux and Windows 7.
Post by Thor Lancelot Simon
Post by Tony Bourke
higher MTUs used to be a way to increase performance for things like
iSCSI but that's not really the case anymore, even with 10 Gigabit
Ethernet. So even 10 gigabit environments choose not to do jumbo frames.
If your network card is PCI, your bandwidth will be limited to about
200-250 megabits per second (which is about 25-35 MBytes/s) because
of the bus speed.
The claim is false. Even giving a conservative 80MByte/sec estimate for
33MHz 32-bit PCI (the theoretical maximum is 132MByte/sec), and allowing
for some adapter overhead, that's well north of 500 megabits per second
-- and, in fact, 400-500 megabits per second can easily be achieved with
such adapters, under NetBSD. Clearly, either 64 bit or 66MHz PCI (never
mind 64 bit, 133Mhz PCI-X) is sufficient to fill a gigabit link or come
very, very close (and indeed it is easy to reproduce this result with
real hardware under NetBSD as well).
You're right, my math was off. Still, PCI 32-bit can't max out a 1
Gigabit link, especially considering it's a shared bus.
Post by Thor Lancelot Simon
Post by Tony Bourke
PCIe will handle full duplex 1 gigabit no problem.
The claim is true.
Thor
Tony
Thor Lancelot Simon
2011-10-10 02:32:47 UTC
Permalink
Post by Tony Bourke
Hi All,
Post by Thor Lancelot Simon
Post by Tony Bourke
changing MTU won't help really.
Do you have data on which to base this conclusioN?
It's been long assumed that to get better throughput, enable jumbo
Why assume? And why trust benchmarketing (are there really any industry
publications that don't primarily engage in precisely that, any more?)
It is very, very easy to measure for yourself and see. Note that 9K
MTU is strictly a lose unless you have a >9K hardware page size as the
SGI systems where the 9K MTU was originally used did.

I have -- many times, since network performance was my job for a long
time too -- measured, and many of those results can be found with the
NetBSD mail archive search engine or with simple Google. Particularly
when using a kernel that doesn't support any kind of "large receive"
accelleration (NetBSD does not) an appropriate MTU size that fits,
with headers, within one or two hardware pages but is larger than 1500
will give a measurable performance benefit end-to-end (though not a very
large one) and a more measurable reduction in receiver CPU utilization.
Try it and see -- but be sure you have first eliminated other obvious
bottlenecks such as tiny TCP windows by setting appropriate socket
buffer sizes, etc.

I really can't agree with you about path MTU discovery, either. With
proper blackhole detection (and if you don't have that, then you
should not be using path MTU discovery at all) it's plenty reliable;
and in any event, using a large local MTU won't cause a sudden magic
change of default to use the link layer MTU as the inital MTU for
remote peers anyway; only local ones.

Thor
Tony Bourke
2011-10-14 22:08:51 UTC
Permalink
Post by Thor Lancelot Simon
Why assume? And why trust benchmarketing (are there really any industry
publications that don't primarily engage in precisely that, any more?)
One of the benchmarks I referenced wasn't from any company, just a curious individual in the virtualization field. The other was a test done by VMWare and NetApp, and they weren't comparing themselves to competitors, just comparing protocols that were already supported. I don't see how that's benchmarketing.
Post by Thor Lancelot Simon
It is very, very easy to measure for yourself and see.
Note that 9K
MTU is strictly a lose unless you have a >9K hardware page size as the
SGI systems where the 9K MTU was originally used did.
Do you have any references to back that up? I'm geniuinly curious.
Post by Thor Lancelot Simon
I have -- many times, since network performance was my job for a long
time too -- measured, and many of those results can be found with the
NetBSD mail archive search engine or with simple Google. Particularly
when using a kernel that doesn't support any kind of "large receive"
accelleration (NetBSD does not) an appropriate MTU size that fits,
with headers, within one or two hardware pages but is larger than 1500
will give a measurable performance benefit end-to-end (though not a very
large one) and a more measurable reduction in receiver CPU utilization.
Try it and see -- but be sure you have first eliminated other obvious
bottlenecks such as tiny TCP windows by setting appropriate socket
buffer sizes, etc.
I spent some time searching, and came up empty. If you could point to any posts/tests, I'd be grateful.
Post by Thor Lancelot Simon
I really can't agree with you about path MTU discovery, either. With
proper blackhole detection (and if you don't have that, then you
should not be using path MTU discovery at all) it's plenty reliable;
and in any event, using a large local MTU won't cause a sudden magic
change of default to use the link layer MTU as the inital MTU for
remote peers anyway; only local ones.
If you're talking about MSS clamping, I agree. But at the same time, is the original poster upping his MTU going to help? Unlikely, and it would likely complicate his network needlessly.

Speaking of the original poster, isn't it a shame how instead of being a group effort to helps users and expand our understanding, the tone quickly devolves into adversarial point/counterpoint battle royal? All we're missing is a phat beat, and we could go rap battle on this.

Tony
Post by Thor Lancelot Simon
Thor
David Laight
2011-10-15 08:27:46 UTC
Permalink
Post by Tony Bourke
Post by Thor Lancelot Simon
I really can't agree with you about path MTU discovery, either. With
proper blackhole detection (and if you don't have that, then you
should not be using path MTU discovery at all) it's plenty reliable;
and in any event, using a large local MTU won't cause a sudden magic
change of default to use the link layer MTU as the inital MTU for
remote peers anyway; only local ones.
If you're talking about MSS clamping, I agree. But at the same time,
is the original poster upping his MTU going to help? Unlikely,
and it would likely complicate his network needlessly.
PMTU discovery relies on ICMP errors (needs fragment) which rely
on the router having actually received the packet.
If you send an overlong packet it will be discarded.
So 'Jumbo' frames can only be used if the devices they might be sent
to are expecting them.

I can't quite remember the exact length of the headers for TCP, but
it is about 50 bytes. So using very long frames gains you 50/1500
or about 3% if the network is 100% loaded. In software terms, the
receiving side can (and in some cases has) been optimised for
receipt of in sequence packets for the same connection.
(Optimising the sender is easier...)
So I suspect the actual speed improvement for jumbo frames can be
arranged to be minimal.
The costs of them are significant, especially if you start considering
the bufferring requirements in ethernet switches.

OTOH I've NFI why they chose 1536 for the ethernet frame limit.
Unless the idea was 1k of 'userdata' plus some headers??
When doing token-ring (& FDDI) the packet limit is based on the token
rotation time - and is defined as a time interval, not a byte count.
IIRC this effectively allows 4k userdata at 4MHz and 16k userdata at 16MHz
although all the protocol stack code is designed to limit the physical
mtu rather then the userdata.

9k packets would be rather magical if you could receive the last 8k
of the packet into a separate, page aligned, buffer! That would allow
page loading on the rx side!

David
--
David Laight: ***@l8s.co.uk
Thor Lancelot Simon
2011-10-15 14:05:37 UTC
Permalink
Post by Tony Bourke
Post by Thor Lancelot Simon
Why assume? And why trust benchmarketing (are there really any industry
publications that don't primarily engage in precisely that, any more?)
One of the benchmarks I referenced wasn't from any company, just a curious individual in the virtualization field. The other was a test done by VMWare and NetApp, and they weren't comparing themselves to competitors, just comparing protocols that were already supported. I don't see how that's benchmarketing.
This is going to look very different on virtual "hardware" and real
hardware. The costs of manipulating the VM system are dramatically
different, and, often enough, the virtual system already has to (or
chooses to) move data around by page flipping.

VMware is going to recommend whatever looks best for their virtualized
environment -- which includes a virtual "switch", which will want more
resources (and possibly have to copy more data) to handle large frames.
Would they deliberately cook a benchmark to favor whatever looks good for
them, but indifferent or bad on real hardware? Probably not. But do they
care if they accidentally report such results and claim they're general?
Again, probably not.
Post by Tony Bourke
Post by Thor Lancelot Simon
It is very, very easy to measure for yourself and see.
You repeat this with no comment. I assume you didn't try the experiment,
however.
Post by Tony Bourke
Post by Thor Lancelot Simon
Note that 9K
MTU is strictly a lose unless you have a >9K hardware page size as the
SGI systems where the 9K MTU was originally used did.
Do you have any references to back that up? I'm geniuinly curious.
There was a beautiful PowerPoint presentation on this on one of the
FreeBSD developers' home pages for many years. I can't find it any more,
which is frustrating.

But, for historical perspective, here is where that weird 9K frame size
comes from, and why it's a poor choice of size for most newer systems:
In the era of SGI Challenge hardware (very large early R4xxx systems)
NFS was almost always used over UDP transport rather than TCP, and most
NFSv2 implementations couldn't use any RPC size other than 8192 bytes.

With Ethernet sized MTUs, the result was that every single NFS RPC for
a bulk data transfer could be guaranteed to generate 6 IP fragments. Not
good.

So SGI used a a HIPPI interface (which could run with a very large MTU)
to evaluate the effect of different MTU settings on UDP NFS performance.
They found a sweet spot at 8K + RPC headers + UDP headers + IP headers +
HIPPI headers -- which works out to just about 9K -- and published this
result. At around the same time, Cray did similar testing and found 30K
to be optimal but didn't publicize this as widely (see, for example,
http://docs.cray.com/books/S-2366-12/html-S-2366-12/zfixedhv8xfxfn.html).

This was taken up by the people working on the various Internet2 projects,
and there was a resulting call to make the whole Internet backbone "9K
clean".

Unfortunately, it appears nobody did much testing on anything with a
page size smaller than 16K. Because that's the other interesting thing
about SGI's result that seemingly didn't occur to them at the time:

It turns out that 9K has _two_ important properties that made it appear
optimal in their benchmark: it is *large* enough to hold an encapsulated
8K NFS transaction but it is also *small* enough to fit in one hardware
page, thus minimizing memory allocation overhead (and presumably thrashing
of pages between CPUs in their large multiprocessor system, too). But
that is only true because they had a 16K page size.

Anyway, there were other large frame sizes "in play" at the time: 3Com
and others were pushing the 4K FDDI frame size for use with Ethernet,
for example, even at 100Mbit. And early Gigabit Ethernet equipment
supported a whole range of weird frame sizes all approximately 9K -- for
example at one point I evaluated switches for an HPC cluster and found
they had maximum MTUs of 8998, 9000, 9024, and 9216 bytes respectively.

Unsurprisingly consumers and system designers started to do their own
testing to see which of these frame sizes might be optimal. This would
be around 2001-2002. CPUs were slower and network adapters didn't do
coalescing optimizations like segmentation offload or large receive, so
it was a lot easier to see the effect of reducing packet count. I know
Jason Thorpe and I had a long discussion of this in public email -- I
thought it was on one of the NetBSD lists, but searching around, it
looks like it may not have been. Right around the same time we were
converging on an optimal MTU of 8k - headers, one of the FreeBSD developers
profiled their kernel while benchmarking with different frame sizes and
made this beautiful graph that explained why this size was a win: because
it minimized the allocation overhead for unit data transferred.

As I recall, the difference between (4k - headers) and (8k - headers) on
a system with 4K pages is not actually that large, but 8k wins slightly.
What is definitely a lose is 9K, where you allocate a whole extra page
but can't really put any useful amount of data in it.

Of course none of this benchmarking is even worth doing if you're not
careful to eliminate other bottlenecks -- particularly, small socket
buffer sizes, which artificially constrain the TCP window and are
probably the single most common cause of "poor throughput" complaints
on our mailing lists -- though of course that's only for a single-stream
test; for a multi-stream application, you often _want_ small socket buffer
sizes, for fairness. Measuring this stuff and then applying the results
to the real world is not simple.

But I firmly believe you have to start with solid measurements of the
basic numbers (such as single-stream TCP throughput) before you try to
draw general conclusions from benchmarks of specific very complex
applications where there are many confounding factors -- like, say,
iSCSI between virtual machines. And if it weren't 9AM on a Sunday and
my very cute 4 year old weren't clamoring for attention, I'd fire up
a couple of spare systems and generate some results for you -- but it
is, so you'll have to run the test yourself, or perhaps pester me to
do it later.

One thing to be _very_ aware of, which changes the balance point
between large and small maximum frame size considerably, though, is
logical frame coalescing -- on receive and on send. This lets the
kernel (and the adapter's descriptor mechanism) treat small frames
like large ones and eliminate most of the software overhead involved
with small frame sizes. This is "segmentation offload" or "large send"
on the transmit side, and "receive side coalescing" or "large receive"
on the receive side.

NetBSD supports segmentation offload but *not* receive side coalescing.
So results with NetBSD, particularly for receive side CPU consumption,
at small frame sizes, may be different (worse) than what you see with
some other operating systems -- and the beneficial effect of large
frames larger, as well.

Also, don't forget there are a lot of cheap network adapters out there
with poor (or poorly documented, thus poorly supported) receive size
interrupt moderation, and that's another case where reducing frame
count for the same amount of data transferred really helps.
Post by Tony Bourke
Post by Thor Lancelot Simon
I really can't agree with you about path MTU discovery, either. With
proper blackhole detection (and if you don't have that, then you
should not be using path MTU discovery at all) it's plenty reliable;
and in any event, using a large local MTU won't cause a sudden magic
change of default to use the link layer MTU as the inital MTU for
remote peers anyway; only local ones.
If you're talking about MSS clamping, I agree.
I don't know what MSS clamping has to do with this, so I can't comment.
My basic point is that running path MTU discovery without blackhole
discovery is insane (even on IPv6 networks) but that with it, it works
fine; also that using path MTU discovery does *not* cause large packets
to be sent to remote peers without probing, so even in networks with
broken routers that don't generate or pass needs-frag ICMP messages,
path MTU does work -- which means mismatched local and remote MTU size
across internets is harmless.
Post by Tony Bourke
But at the same time, is the original poster upping his MTU going to
help?
Quite possibly: he may have a stupid network card without large send
optimization or decent receive interrupt moderation, and he may have
his socket buffers set wrong (which makes TCP very sensitive to
latency even when there's plenty of bandwidth available). In these
cases, using a larger MTU might immediately show him better performance;
and it is very easy for him to test, rather than simply being persuaded
not to because of random references to magazine articles thrown out
onto a mailing list.

The best suggstion in this thread, however, was David's, namely that
he ensure his socket buffer sizes are appropriately large.
Post by Tony Bourke
Unlikely, and it would likely complicate his network needlessly.
That's your opinion. Unfortunately, you backed it up with a lot of
vague details which were, as far as I can tell, just plain wrong (like
your claim that path MTU doesn't work, and your wrong math about PCI
bandwidth).

When misinformation like that flows out onto our lists and isn't
contradicted, it's a problem for everyone. So I apologize if you feel
I'm being unduly adversarial, but I think it really is important that
when a user asks for help with _X_, and someone responds with related
misinformation about _Y_, both topics receive equal attention and we
try to end up with the best possible answer to each.

Thor

Hisashi T Fujinaka
2011-10-09 22:31:13 UTC
Permalink
Post by Thor Lancelot Simon
Post by Tony Bourke
PCIe will handle full duplex 1 gigabit no problem.
The claim is true.
My day job is LAN netwwork performance. One lane of PCIe Gen 1 will
barely handle line rate, but it does handle line rate.
--
Hisashi T Fujinaka - ***@twofifty.com
BSEE(6/86) + BSChem(3/95) + BAEnglish(8/95) + MSCS(8/03) + $2.50 = latte
Tony Bourke
2011-10-09 22:36:41 UTC
Permalink
Post by Hisashi T Fujinaka
Post by Thor Lancelot Simon
Post by Tony Bourke
PCIe will handle full duplex 1 gigabit no problem.
The claim is true.
My day job is LAN netwwork performance. One lane of PCIe Gen 1 will
barely handle line rate, but it does handle line rate.
Yeah, PCIe x1 is 2 Gbits, which is just enough for full duplex. Since
it's a serial bus (unlike PCI's shared/arbitrated bus) with dedicated
bandwidth, you can count on the 2 Gbits at all times too.

Tony
Hisashi T Fujinaka
2011-10-09 22:57:37 UTC
Permalink
Post by Hisashi T Fujinaka
Post by Thor Lancelot Simon
Post by Tony Bourke
PCIe will handle full duplex 1 gigabit no problem.
The claim is true.
My day job is LAN netwwork performance. One lane of PCIe Gen 1 will
barely handle line rate, but it does handle line rate.
Yeah, PCIe x1 is 2 Gbits, which is just enough for full duplex. Since it's a
serial bus (unlike PCI's shared/arbitrated bus) with dedicated bandwidth, you
can count on the 2 Gbits at all times too.
The bandwidth is usually measured in Gigatransfers/sec, but remember
PCIe overhead. The maximum data transfer is probably 128 bytes and with
all the memory reads/writes/writebacks and 8b/10b encoding, you'll
likely hit max PCIe bus utilization at line rate.
--
Hisashi T Fujinaka - ***@twofifty.com
BSEE(6/86) + BSChem(3/95) + BAEnglish(8/95) + MSCS(8/03) + $2.50 = latte
Matthias Scheler
2011-10-11 07:01:02 UTC
Permalink
Post by Tony Bourke
Yeah, PCIe x1 is 2 Gbits, which is just enough for full duplex.
A single PCIe 1.1 lane provides a throughput of 250MB/s (megabyte per
second) full duplex. This is actually enough for two times
gigabit ethernet full duplex.

Kind regards
--
Matthias Scheler http://zhadum.org.uk/
David Young
2011-10-09 16:24:55 UTC
Permalink
Post by Ravishankar S
Thanks to all who replied. dd now shows 67 MB/sec!!. However, i have
beyond this point. I have a dlink gigabit router and gigabit cards on
both ends. client is an ubuntu laptop
I use these sysctl settings for gigabit TCP performance:

kern.sbmax=16777216
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvspace=262144
net.inet.tcp.sendspace=262144

Dave
--
David Young OJC Technologies is now Pixo
***@pixotech.com Urbana, IL (217) 344-0444 x24
Thor Lancelot Simon
2011-10-09 19:09:31 UTC
Permalink
Post by Ravishankar S
Thanks to all who replied. dd now shows 67 MB/sec!!. However, i have
beyond this point. I have a dlink gigabit router and gigabit cards on
both ends. client is an ubuntu laptop
Can your laptop really sustain more than 25MB/sec of writes to its disk?

Thor
matthew green
2011-10-10 02:17:43 UTC
Permalink
Post by Thor Lancelot Simon
Post by Ravishankar S
Thanks to all who replied. dd now shows 67 MB/sec!!. However, i have
beyond this point. I have a dlink gigabit router and gigabit cards on
both ends. client is an ubuntu laptop
Can your laptop really sustain more than 25MB/sec of writes to its disk?
dd if=/dev/zero bs=1m count=8192 of=bigfile
8192+0 records in
8192+0 records out
8589934592 bytes transferred in 124.189 secs (69168240 bytes/sec)

this is just a normal 500G laptop disk, not quite 2 years old.


.mrg.
Marcin M. Jessa
2011-10-11 21:13:10 UTC
Permalink
Post by matthew green
Post by Thor Lancelot Simon
Post by Ravishankar S
Thanks to all who replied. dd now shows 67 MB/sec!!. However, i have
beyond this point. I have a dlink gigabit router and gigabit cards on
both ends. client is an ubuntu laptop
Can your laptop really sustain more than 25MB/sec of writes to its disk?
dd if=/dev/zero bs=1m count=8192 of=bigfile
8192+0 records in
8192+0 records out
8589934592 bytes transferred in 124.189 secs (69168240 bytes/sec)
this is just a normal 500G laptop disk, not quite 2 years old.
Yeah, I get easily 80MB/s scping files from/to my 2 years old laptop.
With your dd test I get 50MB/s.
--
Marcin M. Jessa
Matthias Scheler
2011-10-11 06:57:23 UTC
Permalink
Post by Ravishankar S
Thanks to all who replied. dd now shows 67 MB/sec!!. However, i have
beyond this point.
FTP is not suitable as benchmark for network throughput. Please try
"benchmark/netio" or "benchmark/ttcp" from "pkgsrc" instead.

Kind regards
--
Matthias Scheler http://zhadum.org.uk/
David Laight
2011-10-11 10:15:53 UTC
Permalink
Post by Matthias Scheler
Post by Ravishankar S
Thanks to all who replied. dd now shows 67 MB/sec!!. However, i have
beyond this point.
FTP is not suitable as benchmark for network throughput. Please try
"benchmark/netio" or "benchmark/ttcp" from "pkgsrc" instead.
I have sometimes measured the time to ftp large sparse files
to /dev/null, that does give a useful limit for ftp.

With some copy operations (I can't remember if ftp is one of them) it
can make a difference whether you are 'pulling' or 'pushing' the file.

David
--
David Laight: ***@l8s.co.uk
Loading...