Discussion:
Testing 7.0 Beta: FFS still very slow when creating files
Stephan
2014-08-22 15:59:37 UTC
Permalink
Hi,

I´m testing a current build of NetBSD 7.0 BETA and found that an old
FFS issue is still there. It was discussed here:

https://mail-index.netbsd.org/tech-kern/2012/10/25/msg014236.html

Creating many files on FFSv2 is very slow. One example from this box:

# time sh -c 'seq 1 30000|xargs touch '
114.48s real 0.23s user 113.89s system


load averages: 0.28, 0.10, 0.03; up 0+00:14:04 17:53:22
23 processes: 21 sleeping, 2 on CPU
CPU states: 0.0% user, 0.0% nice, 50.0% system, 0.0% interrupt, 50.0% idle
Memory: 44M Act, 7844K Exec, 26M File, 1827M Free
Swap: 2040M Total, 2040M Free

Remarkable is that the disk does almost nothing during this, but one
CPU is completely utilized in kernel mode. Reopening those files is
much faster (as fast as creating them should be):

# time sh -c 'seq 1 30000|xargs touch '
1.09s real 0.16s user 1.01s system

I see something similar happening on tmpfs:

# time sh -c 'seq 1 30000|xargs touch '
36.34s real 0.38s user 36.04s system

Creating files is faster, but still not as fast as it should be. The
only filesystem that (like before) performs fast is the experimental
ZFS port:

# time sh -c 'seq 1 30000|xargs touch '
3.73s real 0.19s user 3.63s system


Has anybody an idea on this or how to track this down? At the moment,
I can´t even enter ddb using Strg+Alt+Esc keys for some reason. I´ve
also seen people playing with dtrace but that doesn´t seem to be
included.


Regards,

Stephan


PS:

My filesystems:

# mount
/dev/wd0a on / type ffs (local)
/dev/wd0e on /a type ffs (log, local)
/dev/wd0g on /c type ffs (local)
kernfs on /kern type kernfs (local)
ptyfs on /dev/pts type ptyfs (local)
procfs on /proc type procfs (local)
tmpfs on /var/shm type tmpfs (local)
mypool on /mypool type zfs (local)
tmpfs on /A type tmpfs (local)

BTW: I told sysinst to make the root filesystem a FFSv2 one, but I
think the "log" option is missing in /ets/fstab. I tested on /a.


This is what lockstat says on FFS:

# lockstat time sh -c 'seq 1 30000|xargs touch '
114.56 real 0.23 user 113.95 sys
Elapsed time: 114.58 seconds.

-- Adaptive mutex spin

Total% Count Time/ms Lock Caller
------ ------- --------- ---------------------- ------------------------------
98.59 15 81.10 mntvnode_lock <all>
98.59 14 81.10 mntvnode_lock vfs_insmntque+2a
0.00 1 0.00 mntvnode_lock vfs_vnode_iterator_destroy+14
0.78 47 0.64 fffffe807f4f2140 <all>
0.64 43 0.52 fffffe807f4f2140 uvm_fault_internal+857
0.13 3 0.11 fffffe807f4f2140 uvm_unmap_remove+200
0.01 1 0.01 fffffe807f4f2140 genfs_lock+83
0.56 4 0.46 fffffe807f4f2c80 cache_lookup+62
0.02 1 0.01 fffffe807f4f2580 uvm_fault_internal+857
0.02 1 0.01 fffffe80779f0840 uvm_fault_internal+363
0.01 20 0.01 bufcache_lock <all>
0.01 10 0.01 bufcache_lock getblk+2f
0.00 6 0.00 bufcache_lock bdwrite+146
0.00 2 0.00 bufcache_lock brelse+19
0.00 2 0.00 bufcache_lock bwrite+e5
0.01 1 0.01 fffffe8077679440 uvm_fault_internal+363
0.01 1 0.01 fffffe807eb8cf80 uvm_fault_internal+857
0.00 15 0.00 uvm_pageqlock <all>
0.00 6 0.00 uvm_pageqlock uvm_fault_internal+94d
0.00 5 0.00 uvm_pageqlock uvm_fault_lower_enter+189
0.00 4 0.00 uvm_pageqlock uvm_fault_upper_enter.isra.5+11
0.00 1 0.00 fffffe807d4b63f8 wapbl_add_buf+3f

-- Adaptive mutex sleep

Total% Count Time/ms Lock Caller
------ ------- --------- ---------------------- ------------------------------
100.00 1 4.76 mntvnode_lock vfs_insmntque+2a

-- Spin mutex spin

Total% Count Time/ms Lock Caller
------ ------- --------- ---------------------- ------------------------------
52.32 1 0.00 fffffe807f4f2bc0 mi_switch+bd
47.68 3 0.00 uvm_fpageqlock <all>
30.33 2 0.00 uvm_fpageqlock uvm_pagealloc_strat+c3
17.35 1 0.00 uvm_fpageqlock uvm_pagefree+24c

-- RW lock sleep (writer)

Total% Count Time/ms Lock Caller
------ ------- --------- ---------------------- ------------------------------
100.00 9 0.28 fffffe807d4b6400 wapbl_flush+84

-- RW lock sleep (reader)

Total% Count Time/ms Lock Caller
------ ------- --------- ---------------------- ------------------------------
100.00 10 275.20 fffffe807d4b6400 wapbl_begin+6a

-- RW lock spin

Total% Count Time/ms Lock Caller
------ ------- --------- ---------------------- ------------------------------
97.81 2 0.09 fffffe807d4b6400 wapbl_begin+6a
2.19 1 0.00 fffffe807ed96360 genfs_lock+7b

-- Kernel lock spin

Total% Count Time/ms Lock Caller
------ ------- --------- ---------------------- ------------------------------
100.00 42 4.94 kernel_lock <all>
97.51 10 4.81 kernel_lock bdev_strategy+53
2.02 18 0.10 kernel_lock VOP_POLL+43
0.31 9 0.02 kernel_lock intr_biglock_wrapper+12
0.14 4 0.01 kernel_lock biodone2+6f
0.02 1 0.00 kernel_lock VOP_LOCK+43

And the box´s dmesg:

# dmesg
Copyright (c) 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005,
2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014
The NetBSD Foundation, Inc. All rights reserved.
Copyright (c) 1982, 1986, 1989, 1991, 1993
The Regents of the University of California. All rights reserved.

NetBSD 7.0_BETA (GENERIC.201408210820Z)
total memory = 2039 MB
avail memory = 1963 MB
kern.module.path=/stand/amd64/7.0/modules
timecounter: Timecounters tick every 10.000 msec
timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
MICRO-STAR INTERNATIONAL CO., LTD MS-7418 ( )
mainbus0 (root)
ACPI: RSDP 0xf91a0 000024 (v02 ACPIAM)
ACPI: XSDT 0x7f7c0100 000054 (v01 MSI_NB NOTEBOOK 20081119 MSFT 00000097)
ACPI: FACP 0x7f7c0290 0000F4 (v03 111908 FACP1105 20081119 MSFT 00000097)
ACPI: DSDT 0x7f7c05c0 003E75 (v01 A7418 A7418170 00000170 INTL 20051117)
ACPI: FACS 0x7f7ce000 000040
ACPI: APIC 0x7f7c0390 00006C (v01 111908 APIC1105 20081119 MSFT 00000097)
ACPI: MCFG 0x7f7c0400 00003C (v01 111908 OEMMCFG 20081119 MSFT 00000097)
ACPI: SLIC 0x7f7c0440 000176 (v01 MSI_NB NOTEBOOK 20081119 MSFT 00000097)
ACPI: OEMB 0x7f7ce040 000061 (v01 111908 OEMB1105 20081119 MSFT 00000097)
ACPI: HPET 0x7f7c4440 000038 (v01 111908 OEMHPET 20081119 MSFT 00000097)
ACPI: All ACPI Tables successfully acquired
cpu0 at mainbus0 apid 0: Intel(R) Atom(TM) CPU 230 @ 1.60GHz, id 0x106c2
cpu1 at mainbus0 apid 1: Intel(R) Atom(TM) CPU 230 @ 1.60GHz, id 0x106c2
ioapic0 at mainbus0 apid 2: pa 0xfec00000, version 0x20, 24 pins
acpi0 at mainbus0: Intel ACPICA 20131218
acpi0: X/RSDT: OemId <MSI_NB,NOTEBOOK,20081119>, AslId <MSFT,00000097>
acpi0: SCI interrupting at int 9
timecounter: Timecounter "ACPI-Fast" frequency 3579545 Hz quality 1000
hpet0 at acpi0: high precision event timer (mem 0xfed00000-0xfed00400)
timecounter: Timecounter "hpet0" frequency 14318180 Hz quality 2000
MCH (PNP0C01) at acpi0 not configured
attimer1 at acpi0 (TMR, PNP0100): io 0x40-0x43 irq 0
pcppi1 at acpi0 (SPKR, PNP0800): io 0x61
midi0 at pcppi1: PC speaker
sysbeep0 at pcppi1
RMSC (PNP0C02) at acpi0 not configured
FWH (INT0800) at acpi0 not configured
FWHE (PNP0C02) at acpi0 not configured
OMSC (PNP0C02) at acpi0 not configured
PCIE (PNP0C02) at acpi0 not configured
RMEM (PNP0C01) at acpi0 not configured
acpibut0 at acpi0 (PWRB, PNP0C0C-170): ACPI Power Button
ACPI: Enabled 1 GPEs in block 00 to 1F
ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S2_]
(20131218/hwxface-646)
ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S3_]
(20131218/hwxface-646)
attimer1: attached to pcppi1
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
pchb0 at pci0 dev 0 function 0: vendor 0x8086 product 0x2770 (rev. 0x02)
agp0 at pchb0: i915-family chipset
agp0: detected 7932k stolen memory
agp0: aperture at 0xe0000000, size 0x10000000
vga0 at pci0 dev 2 function 0: vendor 0x8086 product 0x2772 (rev. 0x02)
wsdisplay0 at vga0 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
i915drm0 at vga0: Intel i945G
i915drm0: AGP at 0xe0000000 256MB
i915drm0: Initialized i915 1.6.0 20080730
hdaudio0 at pci0 dev 27 function 0: HD Audio Controller
hdaudio0: interrupting at ioapic0 pin 16
hdafg0 at hdaudio0: Realtek ALC888
hdafg0: DAC00 8ch: Speaker [Jack]
hdafg0: DAC01 2ch: HP Out [Jack]
hdafg0: ADC02 2ch: Line In [Jack], Mic In [Jack]
hdafg0: ADC03 2ch: Mic In [Jack]
hdafg0: 8ch/2ch 44100Hz 48000Hz 96000Hz 192000Hz PCM16 PCM20 PCM24 AC3
audio0 at hdafg0: full duplex, playback, capture, mmap, independent
ppb0 at pci0 dev 28 function 0: vendor 0x8086 product 0x27d0 (rev. 0x01)
ppb0: PCI Express 1.0 <Root Port of PCI-E Root Complex> x1 @ 2.5Gb/s
pci1 at ppb0 bus 1
pci1: i/o space, memory space enabled, rd/line, wr/inv ok
re0 at pci1 dev 0 function 0: RealTek 8168/8111 PCIe Gigabit Ethernet
(rev. 0x02)
re0: interrupting at ioapic0 pin 16
re0: Ethernet address 00:21:85:97:da:09
re0: using 256 tx descriptors
rgephy0 at re0 phy 7: RTL8169S/8110S/8211 1000BASE-T media interface, rev. 2
rgephy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-FDX, auto
ppb1 at pci0 dev 28 function 1: vendor 0x8086 product 0x27d2 (rev. 0x01)
ppb1: PCI Express 1.0 <Root Port of PCI-E Root Complex> x1 @ 2.5Gb/s
pci2 at ppb1 bus 2
pci2: i/o space, memory space enabled, rd/line, wr/inv ok
ppb2 at pci0 dev 28 function 2: vendor 0x8086 product 0x27d4 (rev. 0x01)
ppb2: PCI Express 1.0 <Root Port of PCI-E Root Complex> x1 @ 2.5Gb/s
pci3 at ppb2 bus 3
pci3: i/o space, memory space enabled, rd/line, wr/inv ok
ppb3 at pci0 dev 28 function 3: vendor 0x8086 product 0x27d6 (rev. 0x01)
ppb3: PCI Express 1.0 <Root Port of PCI-E Root Complex> x1 @ 2.5Gb/s
pci4 at ppb3 bus 4
pci4: i/o space, memory space enabled, rd/line, wr/inv ok
uhci0 at pci0 dev 29 function 0: vendor 0x8086 product 0x27c8 (rev. 0x01)
uhci0: interrupting at ioapic0 pin 23
usb0 at uhci0: USB revision 1.0
uhci1 at pci0 dev 29 function 1: vendor 0x8086 product 0x27c9 (rev. 0x01)
uhci1: interrupting at ioapic0 pin 19
usb1 at uhci1: USB revision 1.0
uhci2 at pci0 dev 29 function 2: vendor 0x8086 product 0x27ca (rev. 0x01)
uhci2: interrupting at ioapic0 pin 18
usb2 at uhci2: USB revision 1.0
uhci3 at pci0 dev 29 function 3: vendor 0x8086 product 0x27cb (rev. 0x01)
uhci3: interrupting at ioapic0 pin 16
usb3 at uhci3: USB revision 1.0
ehci0 at pci0 dev 29 function 7: vendor 0x8086 product 0x27cc (rev. 0x01)
ehci0: interrupting at ioapic0 pin 23
ehci0: EHCI version 1.0
ehci0: companion controllers, 2 ports each: uhci0 uhci1 uhci2 uhci3
usb4 at ehci0: USB revision 2.0
ppb4 at pci0 dev 30 function 0: vendor 0x8086 product 0x244e (rev. 0xe1)
pci5 at ppb4 bus 5
pci5: i/o space, memory space enabled
ichlpcib0 at pci0 dev 31 function 0: vendor 0x8086 product 0x27b8 (rev. 0x01)
timecounter: Timecounter "ichlpcib0" frequency 3579545 Hz quality 1000
ichlpcib0: 24-bit timer
ichlpcib0: TCO (watchdog) timer configured.
piixide0 at pci0 dev 31 function 1: Intel 82801GB/GR IDE Controller
(ICH7) (rev. 0x01)
piixide0: bus-master DMA support present
piixide0: primary channel configured to compatibility mode
piixide0: primary channel interrupting at ioapic0 pin 14
atabus0 at piixide0 channel 0
piixide0: secondary channel configured to compatibility mode
piixide0: secondary channel interrupting at ioapic0 pin 15
atabus1 at piixide0 channel 1
piixide1 at pci0 dev 31 function 2: Intel 82801GB/GR Serial ATA/Raid
Controller (ICH7) (rev. 0x01)
piixide1: bus-master DMA support present
piixide1: primary channel configured to native-PCI mode
piixide1: using ioapic0 pin 19 for native-PCI interrupt
atabus2 at piixide1 channel 0
piixide1: secondary channel configured to native-PCI mode
atabus3 at piixide1 channel 1
ichsmb0 at pci0 dev 31 function 3: vendor 0x8086 product 0x27da (rev. 0x01)
ichsmb0: interrupting at ioapic0 pin 19
iic0 at ichsmb0: I2C bus
isa0 at ichlpcib0
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
acpicpu0 at cpu0: ACPI CPU
acpicpu0: C1: HLT, lat 0 us, pow 0 mW
coretemp0 at cpu0: thermal sensor, 1 C resolution
acpicpu1 at cpu1: ACPI CPU
timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
uhub0 at usb1: vendor 0x8086 UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhub1 at usb3: vendor 0x8086 UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhub2 at usb0: vendor 0x8086 UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
uhub3 at usb2: vendor 0x8086 UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub3: 2 ports with 2 removable, self powered
uhub4 at usb4: vendor 0x8086 EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub4: 8 ports with 8 removable, self powered
IPsec: Initialized Security Association Processing.
umass0 at uhub4 port 6 configuration 1 interface 0
umass0: Generic USB2.0-CRW, rev 2.00/58.87, addr 2
umass0: using SCSI over Bulk-Only
scsibus0 at umass0: 2 targets, 1 lun per target
sd0 at scsibus0 target 0 lun 0: <Generic-, Multi-Card, 1.00> disk removable
sd0: drive offline
wd0 at atabus2 drive 0
wd0: <WDC WD3200AAJS-22B4A0>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 298 GB, 620181 cyl, 16 head, 63 sec, 512 bytes/sect x 625142448 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133)
wd0(piixide1:0:0): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA)
atapibus0 at atabus3: 2 targets
cd0 at atapibus0 drive 0: <ATAPI DVD A DH20A6S, , 7P55> cdrom removable
cd0: 32-bit data port
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
cd0(piixide1:1:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA)
Kernelized RAIDframe activated
pad0: outputs: 44100Hz, 16-bit, stereo
audio1 at pad0: half duplex, playback, capture
boot device: wd0
root on wd0a dumps on wd0b
root file system type: ffs
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)
Edgar Fuß
2014-08-22 16:11:57 UTC
Permalink
I cannot contribute to your problem but the problem originally discussed in
the thread you are referring to at
Post by Stephan
https://mail-index.netbsd.org/tech-kern/2012/10/25/msg014236.html
has been tracked down (by me) and solved by ***@.
It was RAID specific and due to WAPBL not honoring block alignments
when flushing the log. It saturated the RAID components, not the CPU.
Taylor R Campbell
2014-08-22 16:29:00 UTC
Permalink
Date: Fri, 22 Aug 2014 17:59:37 +0200
From: Stephan <***@googlemail.com>

Has anybody an idea on this or how to track this down? At the moment,
I can't even enter ddb using Strg+Alt+Esc keys for some reason. I've
also seen people playing with dtrace but that doesn't seem to be
included.

Dtrace may be a good idea. You can use it by

(a) using a kernel built with `options KDTRACE_HOOKS',
(b) using a userland built with MKDTRACE=yes,
(c) modload /stand/ARCH/VERSION/solaris.kmod
modload /stand/ARCH/VERSION/dtrace.kmod
modload /stand/ARCH/VERSION/fbt.kmod
modload /stand/ARCH/VERSION/sdt.kmod
(d) mkdir /dev/dtrace && mknod /dev/dtrace/dtrace c dtrace

(Yes, this is too much work. Someone^TM should turn it all on by
default for netbsd-7...!)
From the lockstat output it looks like there's a lot of use of
mntvnode_lock, which suggests this may be related to hannken@'s vnode
cache changes. Might be worthwhile to sample stack traces of
vfs_insmntque, with something like

dtrace -n 'fbt::vfs_insmntqueue:entry { @[stack()]++ }'

or perhaps sample stack traces of the mutex_enters of mntvnode_lock.
Stephan
2014-08-24 13:26:46 UTC
Permalink
Okay maybe I should wait for dtrace to be enabled and working by default.

As I said, I could´t even enter ddb. Otherwise I could have placed a
breakpoint on mutex_enter() to get example stack traces. I wonder,
though, if this really can be related to locking as this test is
basically single-threaded.

Regards,

Stephan
Post by Taylor R Campbell
Date: Fri, 22 Aug 2014 17:59:37 +0200
Has anybody an idea on this or how to track this down? At the moment,
I can't even enter ddb using Strg+Alt+Esc keys for some reason. I've
also seen people playing with dtrace but that doesn't seem to be
included.
Dtrace may be a good idea. You can use it by
(a) using a kernel built with `options KDTRACE_HOOKS',
(b) using a userland built with MKDTRACE=yes,
(c) modload /stand/ARCH/VERSION/solaris.kmod
modload /stand/ARCH/VERSION/dtrace.kmod
modload /stand/ARCH/VERSION/fbt.kmod
modload /stand/ARCH/VERSION/sdt.kmod
(d) mkdir /dev/dtrace && mknod /dev/dtrace/dtrace c dtrace
(Yes, this is too much work. Someone^TM should turn it all on by
default for netbsd-7...!)
From the lockstat output it looks like there's a lot of use of
cache changes. Might be worthwhile to sample stack traces of
vfs_insmntque, with something like
or perhaps sample stack traces of the mutex_enters of mntvnode_lock.
Taylor R Campbell
2014-08-25 16:00:05 UTC
Permalink
Date: Sun, 24 Aug 2014 15:26:46 +0200
From: Stephan <***@googlemail.com>

Okay maybe I should wait for dtrace to be enabled and working by default.

Well, plugh. Every one of the instructions I rattled off from memory
was slightly wrong, and apparently the default amd64 kernel doesn't
have enough KVA reserved to even load the dtrace modules. Here are
the right instructions:

https://wiki.netbsd.org/tutorials/how_to_enable_and_run_dtrace/

And NKL2_KIMG_ENTRIES has to be bumped from 16 to 32 in
sys/arch/amd64/include/pmap.h for the kernel to have enough KVA to
load the modules -- something we definitely ought to fix pronto.

As I said, I could´t even enter ddb. Otherwise I could have placed a
breakpoint on mutex_enter() to get example stack traces. I wonder,
though, if this really can be related to locking as this test is
basically single-threaded.

Could be, but I seem to recall we already isolated another
vcache-related performance problem to excessive lock/unlock cycles in
a single-threaded test.
J. Hannken-Illjes
2014-08-24 16:57:08 UTC
Permalink
Post by Taylor R Campbell
Date: Fri, 22 Aug 2014 17:59:37 +0200
Has anybody an idea on this or how to track this down? At the moment,
I can't even enter ddb using Strg+Alt+Esc keys for some reason. I've
also seen people playing with dtrace but that doesn't seem to be
included.
Dtrace may be a good idea. You can use it by
(a) using a kernel built with `options KDTRACE_HOOKS',
(b) using a userland built with MKDTRACE=yes,
(c) modload /stand/ARCH/VERSION/solaris.kmod
modload /stand/ARCH/VERSION/dtrace.kmod
modload /stand/ARCH/VERSION/fbt.kmod
modload /stand/ARCH/VERSION/sdt.kmod
(d) mkdir /dev/dtrace && mknod /dev/dtrace/dtrace c dtrace
(Yes, this is too much work. Someone^TM should turn it all on by
default for netbsd-7...!)
From the lockstat output it looks like there's a lot of use of
cache changes. Might be worthwhile to sample stack traces of
vfs_insmntque, with something like
or perhaps sample stack traces of the mutex_enters of mntvnode_lock.
This was my first guess too ...

I tried to bisect and got an increase in time from ~15 secs to ~24 secs
between the time stamps '2012-09-18 06:00 UTC' '2012-09-18 09:00 UTC'.

Someone should redo this test as this interval is the import of the
compiler (GCC 4.5.3 -> 4.5.4) and I had to rebuild tools. I cant
believe this to be a compiler problem.

--
J. Hannken-Illjes - ***@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Btw.: my test script is:

#! /bin/sh

mdconfig /dev/md0d 2048000 &
P=${!}
newfs /dev/rmd0a
mount -t ffs -o log /dev/md0a /mnt

(cd /mnt && time sh -c 'seq 1 30000|xargs touch')

umount /mnt
kill ${P}
J. Hannken-Illjes
2014-08-25 13:55:53 UTC
Permalink
On 24 Aug 2014, at 18:57, J. Hannken-Illjes <***@eis.cs.tu-bs.de> wrote:

<snip>
Post by J. Hannken-Illjes
I tried to bisect and got an increase in time from ~15 secs to ~24 secs
between the time stamps '2012-09-18 06:00 UTC' '2012-09-18 09:00 UTC'.
Someone should redo this test as this interval is the import of the
compiler (GCC 4.5.3 -> 4.5.4) and I had to rebuild tools. I cant
believe this to be a compiler problem.
GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16.

Could this be the cause of this big loss in performance?

--
J. Hannken-Illjes - ***@eis.cs.tu-bs.de - TU Braunschweig (Germany)
Taylor R Campbell
2014-08-25 15:39:21 UTC
Permalink
Date: Mon, 25 Aug 2014 15:55:53 +0200
From: "J. Hannken-Illjes" <***@eis.cs.tu-bs.de>

GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16.

Could this be the cause of this big loss in performance?

Shouldn't be too hard to test this. Perhaps try dropping in the
following replacements for the vcache key comparison and running the
test for each one?
J. Hannken-Illjes
2014-08-25 16:00:26 UTC
Permalink
Post by Taylor R Campbell
Date: Mon, 25 Aug 2014 15:55:53 +0200
GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern.
See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16.
Could this be the cause of this big loss in performance?
Shouldn't be too hard to test this. Perhaps try dropping in the
following replacements for the vcache key comparison and running the
test for each one?
<memequal.c>
We are talking about a kernel from 2012/09 -- vcache came much later.

--
J. Hannken-Illjes - ***@eis.cs.tu-bs.de - TU Braunschweig (Germany)
J. Hannken-Illjes
2014-08-25 18:02:44 UTC
Permalink
Post by J. Hannken-Illjes
<snip>
Post by J. Hannken-Illjes
I tried to bisect and got an increase in time from ~15 secs to ~24 secs
between the time stamps '2012-09-18 06:00 UTC' '2012-09-18 09:00 UTC'.
Someone should redo this test as this interval is the import of the
compiler (GCC 4.5.3 -> 4.5.4) and I had to rebuild tools. I cant
believe this to be a compiler problem.
GCC 4.5.4 disabled builtin memcmp as x86 has no cmpmemsi pattern.
See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052, Comment 16.
Could this be the cause of this big loss in performance?
Short answer: it is -- reverting external/gpl3/gcc/dist/gcc/builtins.c
from Rev. 1.3 to 1.2 brings back the old times which are the same as
they were on NetBSD 6.

Given that this test has many calls to ufs_lookup/cache_lookup using
memcmp to check for equal filenames this is not a surprise.

A rather naive "implementation" of memcmp (see below) drops the running
time from ~15 sec to ~9 secs. We should consider improving our memcmp.

--
J. Hannken-Illjes - ***@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Index: libkern.h
===================================================================
RCS file: /cvsroot/src/sys/lib/libkern/libkern.h,v
retrieving revision 1.106
diff -p -u -2 -r1.106 libkern.h
--- libkern.h 30 Aug 2012 12:16:49 -0000 1.106
+++ libkern.h 25 Aug 2014 17:23:35 -0000
@@ -262,5 +262,18 @@ void *memset(void *, int, size_t);
#if __GNUC_PREREQ__(2, 95) && !defined(_STANDALONE)
#define memcpy(d, s, l) __builtin_memcpy(d, s, l)
-#define memcmp(a, b, l) __builtin_memcmp(a, b, l)
+static inline int __memcmp(const void *a, const void *b, size_t l)
+{
+ const unsigned char *pa = a, *pb = b;
+
+ if (l > 8)
+ return memcmp(a, b, l);
+ while (l-- > 0) {
+ if (__predict_false(*pa != *pb))
+ return *pa < *pb ? -1 : 1;
+ pa++; pb++;
+ }
+ return 0;
+}
+#define memcmp(a, b, l) __memcmp(a, b, l)
#endif
#if __GNUC_PREREQ__(2, 95) && !defined(_STANDALONE)
Taylor R Campbell
2014-08-25 21:09:24 UTC
Permalink
Date: Mon, 25 Aug 2014 20:02:44 +0200
From: "J. Hannken-Illjes" <***@eis.cs.tu-bs.de>

Short answer: it is -- reverting external/gpl3/gcc/dist/gcc/builtins.c
from Rev. 1.3 to 1.2 brings back the old times which are the same as
they were on NetBSD 6.

Given that this test has many calls to ufs_lookup/cache_lookup using
memcmp to check for equal filenames this is not a surprise.

A rather naive "implementation" of memcmp (see below) drops the running
time from ~15 sec to ~9 secs. We should consider improving our memcmp.

Sounds reasonable to me, although it looks like GCC's old builtin
memcmp expansion actually failed to implement our specification: it
returns -1, 0, or +1, like your patch, rather than the difference of
the first differing bytes or zero as our man page specifies. For most
uses it doesn't matter, of course, but we ought to make sure to follow
our own specification.
Joerg Sonnenberger
2014-08-25 21:37:36 UTC
Permalink
Post by Taylor R Campbell
Date: Mon, 25 Aug 2014 20:02:44 +0200
Short answer: it is -- reverting external/gpl3/gcc/dist/gcc/builtins.c
from Rev. 1.3 to 1.2 brings back the old times which are the same as
they were on NetBSD 6.
Given that this test has many calls to ufs_lookup/cache_lookup using
memcmp to check for equal filenames this is not a surprise.
A rather naive "implementation" of memcmp (see below) drops the running
time from ~15 sec to ~9 secs. We should consider improving our memcmp.
Sounds reasonable to me, although it looks like GCC's old builtin
memcmp expansion actually failed to implement our specification: it
returns -1, 0, or +1, like your patch, rather than the difference of
the first differing bytes or zero as our man page specifies. For most
uses it doesn't matter, of course, but we ought to make sure to follow
our own specification.
memcmp is only supposed to provide the correct sign, not the difference.

Joerg
matthew green
2014-08-25 22:04:11 UTC
Permalink
Post by Joerg Sonnenberger
Post by Taylor R Campbell
Sounds reasonable to me, although it looks like GCC's old builtin
memcmp expansion actually failed to implement our specification: it
returns -1, 0, or +1, like your patch, rather than the difference of
the first differing bytes or zero as our man page specifies. For most
uses it doesn't matter, of course, but we ought to make sure to follow
our own specification.
memcmp is only supposed to provide the correct sign, not the difference.
true, but that's not what memcmp(9) says. however, unless we have a
good reason, i think we should change our docs to fit.


.mrg.
Robert Elz
2014-08-26 00:15:03 UTC
Permalink
Date: Tue, 26 Aug 2014 08:04:11 +1000
From: matthew green <***@eterna.com.au>
Message-ID: <***@splode.eterna.com.au>

| Joerg Sonnenberger:
| > memcmp is only supposed to provide the correct sign, not the difference.
|
| true, but that's not what memcmp(9) says.

This is a "normal" problem with man pages - they're written to
document what the code actually does, then interpreted as a specification
of what the code is required to do. Man pages should be the former, the
latter is the job of standards docs.

If anything needs changing, it would be to make it more clear that the
man pages should not be interpreted as an interface specification, but
as a statement of what the implementations actually do - not to be
interpreted as a promise that they will always do that - for what can
be relied upon a reference should be made to the relevant standard
(which can be POSIX (or IEEE for C, or anyone else), or POSIX (etc) as
amended by NetBSD, or a NetBSD private standard for stuff that either
isn't documented by anyone else's standards doc, or where NetBSD's
version has simply decided to be different.

kre
Sverre Froyen
2014-08-26 02:08:03 UTC
Permalink
Post by Robert Elz
Date: Tue, 26 Aug 2014 08:04:11 +1000
| > memcmp is only supposed to provide the correct sign, not the difference.
|
| true, but that's not what memcmp(9) says.

Post by Robert Elz
If anything needs changing, it would be to make it more clear that the
man pages should not be interpreted as an interface specification, but
as a statement of what the implementations actually do - not to be
interpreted as a promise that they will always do that - for what can
be relied upon a reference should be made to the relevant standard
(which can be POSIX (or IEEE for C, or anyone else), or POSIX (etc) as
amended by NetBSD, or a NetBSD private standard for stuff that either
isn't documented by anyone else's standards doc, or where NetBSD's
version has simply decided to be different.
Does the NetBSD version originate with BSD? The man pages for FreeBSD,
OpenBSD and Mac OS X all have the same description.

Sverre
Robert Elz
2014-08-26 04:41:59 UTC
Permalink
Date: Mon, 25 Aug 2014 20:08:03 -0600
From: Sverre Froyen <***@viewmark.com>
Message-ID: <E88CBABB-1884-412B-A447-***@viewmark.com>

| Does the NetBSD version originate with BSD? The man pages for FreeBSD,
| OpenBSD and Mac OS X all have the same description.

Yes, and of course, as they all documented what the BSD implementation did
(that is just subtract bytes until a non-zero value resulted, or end of data).

But there never really was a requirement that it be implemented that way, an
architecture with a byte string compare instruction might leave the results
in the condition codes, which could be translated to -1, 0, 1 and meet the
implementation requirements - testing <0, ==0, or >0 are the only defined
actions on the result - any code assuming that the result tells the
magnitude of the difference between the first bytes that differ (which
would be a truly bizarre thing to care about, given there's no way to
discover from the result which bytes differed) would be broken.

kre
Alan Barrett
2014-08-26 08:27:09 UTC
Permalink
Post by Robert Elz
| > memcmp is only supposed to provide the correct sign, not
| > the difference.
| true, but that's not what memcmp(9) says.
This is a "normal" problem with man pages - they're written to
document what the code actually does, then interpreted as a
specification of what the code is required to do. Man pages
should be the former, the latter is the job of standards docs.
Often, there are no standards docs, and the man page has to serve
as both a specification of the parts of the interface that users
can depend on, and documentation of what the code actually does.
For example, it's possible to document "returns -ve, 0, or +ve"
in one part of the man page, as an interface specification, and
"returns the difference" in another part of th man page, as an
implementation note.
Post by Robert Elz
If anything needs changing, it would be to make it more clear
that the man pages should not be interpreted as an interface
specification, but as a statement of what the implementations
actually do - not to be interpreted as a promise that they will
always do that - for what can be relied upon a reference should
be made to the relevant standard (which can be POSIX (or IEEE
for C, or anyone else), or POSIX (etc) as amended by NetBSD, or
a NetBSD private standard for stuff that either isn't documented
by anyone else's standards doc, or where NetBSD's version has
simply decided to be different.
In cases where there really is a standard that can be referred to,
that might work, but I like to have all the information in one
place. If it's easy for the NetBSD man page to say both what's
promised, and what is actually done, then I would like it to do
so. I think that this helps both people using the interface and
people changing the implementation.

--apb (Alan Barrett)

Taylor R Campbell
2014-08-25 22:16:57 UTC
Permalink
Date: Mon, 25 Aug 2014 23:37:36 +0200
From: Joerg Sonnenberger <***@britannica.bec.de>

memcmp is only supposed to provide the correct sign, not the difference.

Our man page says difference, and has said that from the beginning.
Christos Zoulas
2014-08-26 07:38:40 UTC
Permalink
Post by Joerg Sonnenberger
Post by Taylor R Campbell
Date: Mon, 25 Aug 2014 20:02:44 +0200
Short answer: it is -- reverting external/gpl3/gcc/dist/gcc/builtins.c
from Rev. 1.3 to 1.2 brings back the old times which are the same as
they were on NetBSD 6.
Given that this test has many calls to ufs_lookup/cache_lookup using
memcmp to check for equal filenames this is not a surprise.
A rather naive "implementation" of memcmp (see below) drops the running
time from ~15 sec to ~9 secs. We should consider improving our memcmp.
Sounds reasonable to me, although it looks like GCC's old builtin
memcmp expansion actually failed to implement our specification: it
returns -1, 0, or +1, like your patch, rather than the difference of
the first differing bytes or zero as our man page specifies. For most
uses it doesn't matter, of course, but we ought to make sure to follow
our own specification.
memcmp is only supposed to provide the correct sign, not the difference.
Yes, according to TOG, not according to our documentation. Not that I advocate
to keep our documentation....

christos
Continue reading on narkive:
Loading...