Discussion:
RFC: IRQ affinity (aka interrupt routing)
Kengo NAKAHARA
2014-07-25 09:06:21 UTC
Permalink
Hi,

I implement IRQ affinity (aka interrupt routing) for amd64. Here is
the implementation.
https://github.com/knakahara/netbsd-src/compare/rfc/irq-affinity
But the UI is rough, so could you comment aboud the UI?

The implementation is consist of following three pathes:
(1) IRQ affinity implementation itself
The usage is "sysctl -w kern.cpu_affinity.irq=18:1" ("18" is
IRQ number, "1" is cpuid). This mean the IRQ 18 interrupts
route to cpuid 1 cpu core.
https://github.com/knakahara/netbsd-src/commit/2ccaf5152fe41175a9192164b939d00f957e42cb

(2) new kernfs file to show interrupt numbers per cpu
"vmstat -i" cannot show interrupt numbers per cpu. So I
implement new kernfs file to show per cpu. The usage is "cat
/kern/interrupts". For example, we can see below output
which is like linux's /proc/interrupts
====================
IRQ CPU#00 CPU#01
1 0* 0
3 0* 0
4 0* 0
6 0* 0
7 0* 0
9 0* 0
12 0* 0
14 0* 0
15 0* 0
16 108* 0
17 1193055* 0
18 680358 150*
====================
"*" mean the cpu is routed the IRQ number interrupts.
https://github.com/knakahara/netbsd-src/commit/dc7c17dd2f0cd78a03983849feb8fdd8cbe0d9c6

(3) add new API to show device name in /kern/interrupts
To use efficiently interrupt routing, device name is needed.
I implement new API borrow an idea from OpenBSD's
pci_intr_establish(). If device drivers use this new API,
/kern/interrupts can show device names as below
====================
IRQ CPU#00 CPU#01
1 0* 0 unknown
3 0* 0 unknown
4 0* 0 unknown
6 0* 0 unknown
7 0* 0 unknown
9 0* 0 unknown
12 0* 0 unknown
14 0* 0 unknown
15 0* 0 unknown
16 331* 0 unknown
17 1193149* 0 wm0
18 680358 436* vmx0
====================
Device driver can use such as "device_xname(sc->sc_dev)" as
"xname" parameter.
https://github.com/knakahara/netbsd-src/commit/97adb1f9b2eea4c95cb6b3e6688c6639c17a025d

Above implementation works well, but I think the UI is not so good.
So, I am going to implement new command intrctl(8). The usage is consist
of below two sub command.
(A) "intrctl list"
This sub command show interrupt numbers per cpu (like
/kern/interrupts).

(B) "intrctl -i 18 -c 1"
This sub command let interrupts route other cpu. "18" is IRQ
number, "1" is cpuid (like "sysctl -w kern.cpu_affinity.irq=18:1").

Could you comment which UI is good, current "sysctl and /kern/interrupts"
or new "intrctl"?

Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Martin Husemann
2014-07-25 09:15:35 UTC
Permalink
Thanks a lot for looking at it!

A few general comments:

- best UI is no UI - the kernel should distribute interrupts automatically
(if possible) as fair as possible over time doiing some statistics

- a UI to wire some device interrupts to a special CPU would be ok,
I prefer a new intrctrl for that

- vmstat -i could gain an additional column with the current target cpu
of the interrupt

- the device name is nice, but what about shared interrupts?

Martin
Kengo NAKAHARA
2014-07-25 10:15:02 UTC
Permalink
Hi martin,

Thank you very much for your comment.
Post by Martin Husemann
- best UI is no UI - the kernel should distribute interrupts automatically
(if possible) as fair as possible over time doiing some statistics
I agree the computer should distribute interrupts automatically, but
I think balancing interrupts is too complex for the kernel. So I think
the balancing should be done by the userland daemon which use the UI.
Implementing and tuning the userland daemon are future works.
Post by Martin Husemann
- a UI to wire some device interrupts to a special CPU would be ok,
I prefer a new intrctrl for that
- vmstat -i could gain an additional column with the current target cpu
of the interrupt
I am afraid of breaking backward compatibility, so I avoid to change
existing commands.
Post by Martin Husemann
- the device name is nice, but what about shared interrupts?
I forgot shared interrupts... In the current implement, the device name
is overwritten by the device established later. Of course this is ugly,
so I must fix it.


Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
David Young
2014-07-25 16:41:03 UTC
Permalink
Post by Kengo NAKAHARA
Hi martin,
Thank you very much for your comment.
Post by Martin Husemann
- best UI is no UI - the kernel should distribute interrupts automatically
(if possible) as fair as possible over time doiing some statistics
I agree the computer should distribute interrupts automatically, but
I think balancing interrupts is too complex for the kernel. So I think
the balancing should be done by the userland daemon which use the UI.
Implementing and tuning the userland daemon are future works.
What's the goal of balancing interrupts? Controlling latency? That's
important, but it seems like other considerations might apply. For
example, funneling all interrupts to one core might allow the other
cores to idle in a power-saving state. Also, it might help to avoid
cacheline motion for two interrupts involved in the same network flow to
fire on the same CPU.

Can you explain why you think that balancing interrupts is too complex
for the kernel? I would not necessarily disagree, but it seems like
the kernel has the most immediate access to the relevant interrupt
statistics *and* the responsibility for CPU scheduling, so it's in a
pretty good position to react to imbalances.

Dave
Post by Kengo NAKAHARA
Post by Martin Husemann
- a UI to wire some device interrupts to a special CPU would be ok,
I prefer a new intrctrl for that
- vmstat -i could gain an additional column with the current target cpu
of the interrupt
I am afraid of breaking backward compatibility, so I avoid to change
existing commands.
Post by Martin Husemann
- the device name is nice, but what about shared interrupts?
I forgot shared interrupts... In the current implement, the device name
is overwritten by the device established later. Of course this is ugly,
so I must fix it.
Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.
Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit
--
David Young
***@pobox.com Urbana, IL (217) 721-9981
Kengo NAKAHARA
2014-07-28 03:20:07 UTC
Permalink
Hi dyoung,
Post by David Young
Post by Kengo NAKAHARA
Hi martin,
Thank you very much for your comment.
Post by Martin Husemann
- best UI is no UI - the kernel should distribute interrupts automatically
(if possible) as fair as possible over time doiing some statistics
I agree the computer should distribute interrupts automatically, but
I think balancing interrupts is too complex for the kernel. So I think
the balancing should be done by the userland daemon which use the UI.
Implementing and tuning the userland daemon are future works.
What's the goal of balancing interrupts? Controlling latency? That's
important, but it seems like other considerations might apply. For
example, funneling all interrupts to one core might allow the other
cores to idle in a power-saving state. Also, it might help to avoid
cacheline motion for two interrupts involved in the same network flow to
fire on the same CPU.
To be honest, I don't decide exactly the final goal yet, but I decide
the first milestone is improving the interrupt response in SMP systems
to reduce dropped packets.
# additionally, one of my motivation is helping tests of MPSAFE L2
# networking. See http://mail-index.netbsd.org/tech-kern/2014/06/03/msg017190.html
As you point out, there are the cases which all interrupts should be
funneled to one core. Whereas there are the cases which interrups
should be distributed, such as heavy workload servers and routers.
So, the system shoud have options for the administrators to select
how to balance interrupts.
Post by David Young
Can you explain why you think that balancing interrupts is too complex
for the kernel? I would not necessarily disagree, but it seems like
the kernel has the most immediate access to the relevant interrupt
statistics *and* the responsibility for CPU scheduling, so it's in a
pretty good position to react to imbalances.
I think analyzing interrupt rates, deciding which IRQ's interrupts
shuld be moved and deciding which CPU is assigned the moved interrupts
are complex. As you point out, the kernel is good position to take
statistics. However, the tasks after taking statistics would take
long time. Those tasks are calculating which interrupts increase or
decrease in the previous seconds(, milliseconds, or minutes) and
deciding which cpu is the most "appropriate".
# The cores in a power-saving state may be not appropriate, and two
# interrupts involved in the same network are appropriate to route
# the same CPU, you know :)


Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Thor Lancelot Simon
2014-07-28 03:49:02 UTC
Permalink
Post by Kengo NAKAHARA
I think analyzing interrupt rates, deciding which IRQ's interrupts
shuld be moved and deciding which CPU is assigned the moved interrupts
are complex.
I think this may prove counterproductive. For one reason, it may cause
harmful cache behavior as the datastructures associated with the same
network flow move from one CPU to another as packets are received on
different interfaces for the transmit and receive directions.

I think you really don't want to do anything that could impede the use
of hardware classifiers to put the packets for the same flow on the
same CPUs' queues -- and balancing interrupts in userspace rather than in
the kernel seems to me likely to do just that.

Thor
Kengo NAKAHARA
2014-07-28 08:55:31 UTC
Permalink
Hi Thor,

Fist, I am *not* going to implement automatic balancing interrupts
in the near future. As you point out, it must require more discussion
about automatic balancing interrupts. However, manual routing
interrupts must be needed, such as separating HDD interrupts' CPU
from NIC interrupts' CPU, eliminating some CPUs from CPUs assigned
interrupts to reserve for user processes, and so on. In these cases,
it is no need to balance interrupts automatically, it is enogh to
set affinity by administrators at the boot time or the rare
maintenance time. So, I want to implement manually routing interrupts
at the first step for these usage.
Post by Thor Lancelot Simon
Post by Kengo NAKAHARA
I think analyzing interrupt rates, deciding which IRQ's interrupts
shuld be moved and deciding which CPU is assigned the moved interrupts
are complex.
I think this may prove counterproductive. For one reason, it may cause
harmful cache behavior as the datastructures associated with the same
network flow move from one CPU to another as packets are received on
different interfaces for the transmit and receive directions.
Yes, there are such cases which interrupts shold be not moved, so balancing
interrupts is too complex. I ought to have said full automatic balancing
interrupts daemon is far-off future work. In the immediate future, I want to
route interrupts occasionally by hand.
Post by Thor Lancelot Simon
I think you really don't want to do anything that could impede the use
of hardware classifiers to put the packets for the same flow on the
same CPUs' queues -- and balancing interrupts in userspace rather than in
the kernel seems to me likely to do just that.
If NetBSD support multiqueue (and MSI-X), it may be correct for the NIC
supporting multiqueue.


Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Justin Cormack
2014-07-25 10:22:59 UTC
Permalink
Post by Kengo NAKAHARA
But the UI is rough, so could you comment aboud the UI?
(1) IRQ affinity implementation itself
The usage is "sysctl -w kern.cpu_affinity.irq=18:1" ("18" is
IRQ number, "1" is cpuid). This mean the IRQ 18 interrupts
route to cpuid 1 cpu core.
I would think that
sysctl -w kern.cpu_affinity.irq.18=1
would be much more natural as an interface.

Justin
Mindaugas Rasiukevicius
2014-07-26 16:18:30 UTC
Permalink
Post by Kengo NAKAHARA
<...>
(2) new kernfs file to show interrupt numbers per cpu
"vmstat -i" cannot show interrupt numbers per cpu. So I
implement new kernfs file to show per cpu. The usage is "cat
/kern/interrupts". For example, we can see below output
which is like linux's /proc/interrupts
====================
IRQ CPU#00 CPU#01
1 0* 0
3 0* 0
4 0* 0
6 0* 0
7 0* 0
9 0* 0
12 0* 0
14 0* 0
15 0* 0
16 108* 0
17 1193055* 0
18 680358 150*
====================
"*" mean the cpu is routed the IRQ number interrupts.
https://github.com/knakahara/netbsd-src/commit/dc7c17dd2f0cd78a03983849feb8fdd8cbe0d9c6
Agree with Martin that vmstat -i could grow per-CPU counters, but it might
be a bit more work than you want to do. So, up to you. :)
Post by Kengo NAKAHARA
<...>
Above implementation works well, but I think the UI is not so good.
So, I am going to implement new command intrctl(8). The usage is consist
of below two sub command.
(A) "intrctl list"
This sub command show interrupt numbers per cpu (like
/kern/interrupts).
(B) "intrctl -i 18 -c 1"
This sub command let interrupts route other cpu. "18" is IRQ
number, "1" is cpuid (like "sysctl -w kern.cpu_affinity.irq=18:1").
Could you comment which UI is good, current "sysctl and /kern/interrupts"
or new "intrctl"?
I do not really like /kern/interrupts. intrctl(8) seems like a good
interface. Note that cpuctl(8) has intr/nointr commands which are not
enabled at this moment. They should move to intrctl(8).
--
Mindaugas
Kengo NAKAHARA
2014-07-28 04:25:30 UTC
Permalink
Hi rmind,
Post by Mindaugas Rasiukevicius
Post by Kengo NAKAHARA
<...>
(2) new kernfs file to show interrupt numbers per cpu
"vmstat -i" cannot show interrupt numbers per cpu. So I
implement new kernfs file to show per cpu. The usage is "cat
/kern/interrupts". For example, we can see below output
which is like linux's /proc/interrupts
====================
IRQ CPU#00 CPU#01
1 0* 0
3 0* 0
4 0* 0
6 0* 0
7 0* 0
9 0* 0
12 0* 0
14 0* 0
15 0* 0
16 108* 0
17 1193055* 0
18 680358 150*
====================
"*" mean the cpu is routed the IRQ number interrupts.
https://github.com/knakahara/netbsd-src/commit/dc7c17dd2f0cd78a03983849feb8fdd8cbe0d9c6
Agree with Martin that vmstat -i could grow per-CPU counters, but it might
be a bit more work than you want to do. So, up to you. :)
I consider vmstat -i grow per-CPU counters after implementing intrctl(8).
It is no problem to change the output format of vmstat -i, doesn't it?
Post by Mindaugas Rasiukevicius
Post by Kengo NAKAHARA
<...>
Above implementation works well, but I think the UI is not so good.
So, I am going to implement new command intrctl(8). The usage is consist
of below two sub command.
(A) "intrctl list"
This sub command show interrupt numbers per cpu (like
/kern/interrupts).
(B) "intrctl -i 18 -c 1"
This sub command let interrupts route other cpu. "18" is IRQ
number, "1" is cpuid (like "sysctl -w kern.cpu_affinity.irq=18:1").
Could you comment which UI is good, current "sysctl and /kern/interrupts"
or new "intrctl"?
I do not really like /kern/interrupts. intrctl(8) seems like a good
interface. Note that cpuctl(8) has intr/nointr commands which are not
enabled at this moment. They should move to intrctl(8).
intrctl(8) gain agreements with two developers, and It seems to gain more
agreements, maybe. So I approach to implement intrctl(8). Yes, you told
me the cpuctl(8) sub command at previous meeting :) I move cpuctl intr/
nointr sub comamnd to intrctl(8).


Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Kengo NAKAHARA
2014-08-20 09:06:21 UTC
Permalink
Hi,
Post by Kengo NAKAHARA
I implement IRQ affinity (aka interrupt routing) for amd64. Here is
the implementation.
https://github.com/knakahara/netbsd-src/compare/rfc/irq-affinity
But the UI is rough, so could you comment aboud the UI?
Above implementation works well, but I think the UI is not so good.
So, I am going to implement new command intrctl(8). The usage is consist
of below two sub command.
(A) "intrctl list"
This sub command show interrupt numbers per cpu (like
/kern/interrupts).
(B) "intrctl -i 18 -c 1"
This sub command let interrupts route other cpu. "18" is IRQ
number, "1" is cpuid (like "sysctl -w kern.cpu_affinity.irq=18:1").
I implement "intrctl" for amd64 and i386. Here is the implementation,
https://github.com/knakahara/netbsd-src/compare/rfc/irq-affinity2
and here is the patch
http://knakahara.github.io/patches/netbsd/irq-affinity-initctl.patch

"intrctl" has 4 subcommand,
- list
for each IRQ in the system, display interrupt counts per CPU.
- affinity -c "cpuid" -i "irq"
set affinity "irq" interrupt to "cpuid".
- intr -c "cpuid"
enable interrupts to set affinity to cpuid.
- nointr -c "cpuid"
disable interrupts to set affinity to cpuid.

For example, some results of my rangeley machine
====================
# intrctl list
IRQ CPU#00(+) CPU#02(+) CPU#04(+) CPU#06(+) CPU#08(+) CPU#10(+) CPU#12(+) CPU#14(+)
3 0* 0 0 0 0 0 0 0 unknown
4 0* 0 0 0 0 0 0 0 unknown
9 0* 0 0 0 0 0 0 0 unknown
18 0* 0 0 0 0 0 0 0 unknown
19 1765* 0 0 0 0 0 0 0 unknown, unknown
20 0* 0 0 0 0 0 0 0 unknown
21 0* 0 0 0 0 0 0 0 unknown
22 0* 0 0 0 0 0 0 0 unknown
23 1022* 0 0 0 0 0 0 0 unknown, unknown
====================
"(+)" mean the CPU accept interrupts. "*" mean the interrupts is set
affinity to the CPU.
"unknown" will be device name if device drivers support showing name.
"unknown, unknown" mean 2 devices share the irq.

====================
# intrctl nointr -c 2
# intrctl list
IRQ CPU#00(+) CPU#02(-) CPU#04(+) CPU#06(+) CPU#08(+) CPU#10(+) CPU#12(+) CPU#14(+)
3 0* 0 0 0 0 0 0 0 unknown
4 0* 0 0 0 0 0 0 0 unknown
9 0* 0 0 0 0 0 0 0 unknown
18 0* 0 0 0 0 0 0 0 unknown
19 1790* 0 0 0 0 0 0 0 unknown, unknown
20 0* 0 0 0 0 0 0 0 unknown
21 0* 0 0 0 0 0 0 0 unknown
22 0* 0 0 0 0 0 0 0 unknown
23 1449* 0 0 0 0 0 0 0 unknown, unknown
====================
"(-)" mean the CPU doesn't accept interrupts.

====================
# intrctl nointr -c 0
# intrctl list
IRQ CPU#00(-) CPU#02(-) CPU#04(+) CPU#06(+) CPU#08(+) CPU#10(+) CPU#12(+) CPU#14(+)
3 0 0 0* 0 0 0 0 0 unknown
4 0 0 0* 0 0 0 0 0 unknown
9 0 0 0* 0 0 0 0 0 unknown
18 0 0 0* 0 0 0 0 0 unknown
19 1804 0 5* 0 0 0 0 0 unknown, unknown
20 0 0 0* 0 0 0 0 0 unknown
21 0 0 0* 0 0 0 0 0 unknown
22 0 0 0* 0 0 0 0 0 unknown
23 1507 0 76* 0 0 0 0 0 unknown, unknown
====================
"CPU#00" doesn't accept interrupts, so all interrupt move other cpu which
accept interrupts. "CPU#00"'s IRQ 19 count "1804" mean number of interrupts
which have occured from boot time. The count doesn't increment unless set
affinity the interrupt again.

====================
# intrctl affinity -c 0 -i 23
intrctl: IOC_INTR_AFFINITY: Invalid argument

# intrctl affinity -c 8 -i 23
# intrctl list
IRQ CPU#00(-) CPU#02(-) CPU#04(+) CPU#06(+) CPU#08(+) CPU#10(+) CPU#12(+) CPU#14(+)
3 0 0 0* 0 0 0 0 0 unknown
4 0 0 0* 0 0 0 0 0 unknown
9 0 0 0* 0 0 0 0 0 unknown
18 0 0 0* 0 0 0 0 0 unknown
19 1804 0 21* 0 0 0 0 0 unknown, unknown
20 0 0 0* 0 0 0 0 0 unknown
21 0 0 0* 0 0 0 0 0 unknown
22 0 0 0* 0 0 0 0 0 unknown
23 1507 0 308 0 17* 0 0 0 unknown, unknown
====================
"CPU#00" doesn't accept interrupts, so "intrctl affinity -c 0 -i 23" failed.
Whereas "CPU#08" accept interrupts, so "intrctl affinity -c 8 -i 23" success
and the interrupt move to "CPU#08".

Could you comment the specification and implementation?

Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Kengo NAKAHARA
2014-08-20 09:20:19 UTC
Permalink
Sorry, I typo the patch URL.
Post by Kengo NAKAHARA
and here is the patch
http://knakahara.github.io/patches/netbsd/irq-affinity-initctl.patch
http://knakahara.github.io/patches/netbsd/irq-affinity-intrctl.patch
^

Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Mindaugas Rasiukevicius
2014-08-25 20:15:57 UTC
Permalink
Post by Kengo NAKAHARA
Sorry, I typo the patch URL.
Post by Kengo NAKAHARA
and here is the patch
http://knakahara.github.io/patches/netbsd/irq-affinity-initctl.patch
http://knakahara.github.io/patches/netbsd/irq-affinity-intrctl.patch
Have to admit that I did not read the patch carefully, but why
io_interrupt_sources_lock is __cpu_simple_lock? Why not to re-use
cpu_lock? The locking itself does not seem to be correct either.

How much of the IRQ affinity code (in x86/intr.c) is actually MD?
It seems that a lot of that can be made MI (think of kern/subr_intr.c).

Also, please do not forget to add the BSD license text for the newly
created files.
--
Mindaugas
Kengo NAKAHARA
2014-08-27 00:58:30 UTC
Permalink
Hi,

Thank you for reviewing.
Post by Mindaugas Rasiukevicius
Post by Kengo NAKAHARA
Sorry, I typo the patch URL.
Post by Kengo NAKAHARA
and here is the patch
http://knakahara.github.io/patches/netbsd/irq-affinity-initctl.patch
http://knakahara.github.io/patches/netbsd/irq-affinity-intrctl.patch
Have to admit that I did not read the patch carefully, but why
io_interrupt_sources_lock is __cpu_simple_lock? Why not to re-use
cpu_lock? The locking itself does not seem to be correct either.
Because I wanted to avoid lock contention between IRQ affinity
and process affinity (in paticular sys__sched_setaffinity() in
sys/kern/sys_sched.c), but now I find it is a wrong idea.
I should delete __cpu_simple_lock and modify to re-use cpu_lock.
Post by Mindaugas Rasiukevicius
How much of the IRQ affinity code (in x86/intr.c) is actually MD?
It seems that a lot of that can be made MI (think of kern/subr_intr.c).
I think MI part of IRQ affinity code is not so much, however MI code
surely exists. So, I divide MD part from MI code as much as possible,
and then I move MI code to kern/subr_intr.c.
Post by Mindaugas Rasiukevicius
Also, please do not forget to add the BSD license text for the newly
created files.
Yes, I add the BSD license text.

Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Matt Thomas
2014-08-27 01:09:00 UTC
Permalink
As I've been reading this discussion, it seems very x86 centric.

I've thinking about adding

void intr_distribute(void *ih, const kcpuset_t *newset, kcpuset_t *oldset)

for my ports that can do MP. This could be used to obtain the current
set of cpus setup to receive interrupt for <ih> or set a new sets of
cpus. To set an interrupt across all CPUs, you could use

intr_distribibute(ih, &kcpuset_running, NULL);

By default only the boot CPU would be setup to get interrupts when
the interrupt was established.
Kengo NAKAHARA
2014-08-27 06:16:35 UTC
Permalink
Hi,

Thank you for your idea.
Post by Matt Thomas
As I've been reading this discussion, it seems very x86 centric.
I've thinking about adding
void intr_distribute(void *ih, const kcpuset_t *newset, kcpuset_t *oldset)
for my ports that can do MP. This could be used to obtain the current
set of cpus setup to receive interrupt for <ih> or set a new sets of
cpus. To set an interrupt across all CPUs, you could use
intr_distribibute(ih, &kcpuset_running, NULL);
By default only the boot CPU would be setup to get interrupts when
the interrupt was established.
It seems good, except return value. IRQ affinity may fail (e.g. when
all cpus are set "nointr" flag), so return value should not be void.
I use the API in MI code like this,
====================
intrctl_ioctl(..., void *data, ...)
{
switch(cmd) {
case IOC_INTR_AFFINITY:
ih = intr_handler(data->intrid);
if (ih == NULL )
return EINVAL;

kcpuset_create(&intr_cpuset, true);
kcpuset_set(intr_cpuset, data->cpuid);
error = intr_distribute(ih, intr_cpuset, NULL);
break;
}

return error;
}
====================
Could you comment this design?


BTW, how do you think about MSI/MSI-X proposal?
- yours
http://mail-index.netbsd.org/tech-kern/2011/08/05/msg011130.html
- dyoung's
http://mail-index.netbsd.org/tech-kern/2014/06/06/msg017209.html
- mine
http://mail-index.netbsd.org/tech-kern/2014/07/10/msg017336.html

Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Matt Thomas
2014-08-27 06:50:05 UTC
Permalink
Post by Kengo NAKAHARA
It seems good, except return value. IRQ affinity may fail (e.g. when
all cpus are set "nointr" flag), so return value should not be void.
then we should have a kcpuset_interruptable which is kcpuset_running
minus those cpus which have nointr.

we also need a callback to the interrupt subsystem when intr changes
on a cpu.
Kengo NAKAHARA
2014-08-28 01:34:30 UTC
Permalink
Hi,
Post by Matt Thomas
Post by Kengo NAKAHARA
It seems good, except return value. IRQ affinity may fail (e.g. when
all cpus are set "nointr" flag), so return value should not be void.
then we should have a kcpuset_interruptable which is kcpuset_running
minus those cpus which have nointr.
we also need a callback to the interrupt subsystem when intr changes
on a cpu.
Abobe behavior is one of examples. Other examples are below
- the unsetting cpu has pending interrupts
- the peripheral (e.g. i8259) does not support moving interrupts
to other cpu
- the cpu which will be moved interrupts causes lack of resources
- it cannot get lock of the peripheral

Therefore, in the architectures other than x86, IRQ affinity may fail
for other reasons. So, I think IRQ affinity should fail simply and
return reasonable error code.

Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Mindaugas Rasiukevicius
2014-08-28 21:42:46 UTC
Permalink
Post by Matt Thomas
As I've been reading this discussion, it seems very x86 centric.
I've thinking about adding
void intr_distribute(void *ih, const kcpuset_t *newset, kcpuset_t *oldset)
Agree. And this would be MI interface. Just as Kengo-san already
pointed out, MD may want to return ENOTSUP or whatever else.
--
Mindaugas
Martin Husemann
2014-08-20 09:22:39 UTC
Permalink
Post by Kengo NAKAHARA
# intrctl list
IRQ CPU#00(+) CPU#02(+) CPU#04(+) CPU#06(+)
3 0* 0 0 0
What are you using to identify an interrupt? I.e. is the -i argument
value a generic string (with machine dependend content)?

Martin
Kengo NAKAHARA
2014-08-20 09:38:35 UTC
Permalink
Hi,
Post by Martin Husemann
Post by Kengo NAKAHARA
# intrctl list
IRQ CPU#00(+) CPU#02(+) CPU#04(+) CPU#06(+)
3 0* 0 0 0
What are you using to identify an interrupt? I.e. is the -i argument
value a generic string (with machine dependend content)?
I use IRQ number to idetify interrupts. The number is shown "IRQ" column.
Actually, the value equals to struct intrsource.is_pin.

# BTW, I want to use IRQ numbers of more than 256 for MSI/MSI-X, like FreeBSD.
# It can set affinity MSI/MSI-X.

Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Martin Husemann
2014-08-20 09:46:44 UTC
Permalink
Post by Kengo NAKAHARA
I use IRQ number to idetify interrupts. The number is shown "IRQ" column.
Actually, the value equals to struct intrsource.is_pin.
Isn't a string (like in vmstat -i output) better?

I am not sure we have a globally unique number available on all architectures
to describe any interrupt.

Martin
Kengo NAKAHARA
2014-08-21 01:50:53 UTC
Permalink
Hi,
Post by Martin Husemann
Post by Kengo NAKAHARA
I use IRQ number to idetify interrupts. The number is shown "IRQ" column.
Actually, the value equals to struct intrsource.is_pin.
Isn't a string (like in vmstat -i output) better?
I think IRQ number is better. Because it is short :)
Therefore, vmstat -iv output is differ little from IRQ number. For example,
here is the output of my machine
====================
interrupt total rate
TLB shootdown 732 0
cpu0 timer 151842 100
ioapic0 pin 9 0 0
ioapic0 pin 20 0 0
ioapic0 pin 21 0 0
ioapic0 pin 22 0 0
ioapic0 pin 23 1816 1
ioapic0 pin 19 1807 1
ioapic0 pin 18 0 0
ioapic0 pin 4 0 0
ioapic0 pin 3 0 0
Total 156197 103
====================
Currently, intrctl list does not show "TLB shootdown" and "cpu0 timer".
Furthermore, "ioapic0 pin 9" of vmstat -iv equals to "9" of intrctl list,
"ioapic0 pin 20" of vmstat -iv equals to "20" of intrctl list, and so on.

I think the problem of both vmstat -i output and IRQ number is that we cannot
immediately understand what device relates the line. So, intrctl list support
showing device name. For example, output of my machine with this patch is here.
http://knakahara.github.io/patches/netbsd/support-rangeley-devices.patch
====================
IRQ CPU#00(+) CPU#02(+) CPU#04(+) CPU#06(+) CPU#08(+) CPU#10(+) CPU#12(+) CPU#14(+)
3 0* 0 0 0 0 0 0 0 unknown
4 0* 0 0 0 0 0 0 0 unknown
9 0* 0 0 0 0 0 0 0 unknown
18 0* 0 0 0 0 0 0 0 ichsmb0
19 1777* 0 0 0 0 0 0 0 ahcisata0, ahcisata1
20 0* 0 0 0 0 0 0 0 wm0
21 0* 0 0 0 0 0 0 0 wm1
22 0* 0 0 0 0 0 0 0 wm2
23 1525* 0 0 0 0 0 0 0 wm3, ehci0
====================
There is "unknown" devices yet, but it can show the devices name if the device
drivers use pci_intr_establish_xname() instead of pci_intr_establish().

I think it is easy to set affinity by IRQ number with above output.
Post by Martin Husemann
I am not sure we have a globally unique number available on all architectures
to describe any interrupt.
I am not sure too. If the architecture does not have a unique number to
describe interrupts, it may be difficult for the architecture to support
thisfeature. I do not think about the architectures other than x86 yet.
# I suppose SPARC can support this feature probably, but I don't know
# interrupts mechanism of other achitectures...

Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Martin Husemann
2014-08-21 06:30:01 UTC
Permalink
Post by Kengo NAKAHARA
I think it is easy to set affinity by IRQ number with above output.
Yes, and would be even better if the device/interrupt name would be
allowed as well (intrctl could internally translate it)
Post by Kengo NAKAHARA
Post by Martin Husemann
I am not sure we have a globally unique number available on all architectures
to describe any interrupt.
I am not sure too. If the architecture does not have a unique number to
describe interrupts, it may be difficult for the architecture to support
thisfeature. I do not think about the architectures other than x86 yet.
I suppose everyone has numbers at some level, but it may be a tree, or at
least multiple PICs at the root. If we need numbers, we can assign them
uniquely via a dfs traversal or similar (or by assigning an incremental
one whenever the relevant *_intr_establish() function is called).

However, I am not sure this is the best aproach. We have made everything
else properly machine independend, see for example pci_intr_establish, the
MD definition of pci_intr_handle_t and pci_intr_string. I think we should
use similar techniques here.

Martin
Kengo NAKAHARA
2014-08-21 10:00:15 UTC
Permalink
Hi,
Post by Martin Husemann
Post by Kengo NAKAHARA
I think it is easy to set affinity by IRQ number with above output.
Yes, and would be even better if the device/interrupt name would be
allowed as well (intrctl could internally translate it)
I see, it is good. I will support it future. :)
Post by Martin Husemann
Post by Kengo NAKAHARA
Post by Martin Husemann
I am not sure we have a globally unique number available on all architectures
to describe any interrupt.
I am not sure too. If the architecture does not have a unique number to
describe interrupts, it may be difficult for the architecture to support
thisfeature. I do not think about the architectures other than x86 yet.
I suppose everyone has numbers at some level, but it may be a tree, or at
least multiple PICs at the root. If we need numbers, we can assign them
uniquely via a dfs traversal or similar (or by assigning an incremental
one whenever the relevant *_intr_establish() function is called).
However, I am not sure this is the best aproach. We have made everything
else properly machine independend, see for example pci_intr_establish, the
MD definition of pci_intr_handle_t and pci_intr_string. I think we should
use similar techniques here.
Thank you for information and idea. When it support multi architecture,
it must use abstruct identifiers of interrupt like pci_intr_handle_t.
It may be good to use the device name as the identifier. It it not good
to implement different userland option for each architectures.

At any rate, I need to consider a way to make the interface MD. I cannot
implement IRQ affinity for the architectures other than x86 immediately,
but I try to design the interface to be MD.

Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Manuel Bouyer
2014-08-21 08:05:47 UTC
Permalink
Post by Kengo NAKAHARA
Hi,
Post by Martin Husemann
Post by Kengo NAKAHARA
I use IRQ number to idetify interrupts. The number is shown "IRQ" column.
Actually, the value equals to struct intrsource.is_pin.
Isn't a string (like in vmstat -i output) better?
I think IRQ number is better. Because it is short :)
Therefore, vmstat -iv output is differ little from IRQ number. For example,
here is the output of my machine
====================
interrupt total rate
TLB shootdown 732 0
cpu0 timer 151842 100
ioapic0 pin 9 0 0
ioapic0 pin 20 0 0
ioapic0 pin 21 0 0
ioapic0 pin 22 0 0
ioapic0 pin 23 1816 1
ioapic0 pin 19 1807 1
ioapic0 pin 18 0 0
ioapic0 pin 4 0 0
ioapic0 pin 3 0 0
Total 156197 103
====================
Currently, intrctl list does not show "TLB shootdown" and "cpu0 timer".
Furthermore, "ioapic0 pin 9" of vmstat -iv equals to "9" of intrctl list,
"ioapic0 pin 20" of vmstat -iv equals to "20" of intrctl list, and so on.
On a x86 server here:
interrupt total rate
TLB shootdown 788900019 295
cpu0 timer 265870470 99
ioapic0 pin 9 0 0
ioapic0 pin 16 1491906220 557
ioapic1 pin 14 153885893 57
ioapic0 pin 17 2731 0
ioapic3 pin 0 13 0
ioapic3 pin 1 14 0
ioapic0 pin 21 789 0
ioapic0 pin 20 15285596 5
ioapic0 pin 14 1 0
ioapic0 pin 4 0 0
ioapic0 pin 3 353 0
Total 2715852099 1015

Notice we have both 'ioapic0 pin 14' and 'ioapic1 pin 14',
what number would be used for intrctl in this case ?
--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
Kengo NAKAHARA
2014-08-21 10:35:21 UTC
Permalink
Hi,
Post by Kengo NAKAHARA
interrupt total rate
TLB shootdown 788900019 295
cpu0 timer 265870470 99
ioapic0 pin 9 0 0
ioapic0 pin 16 1491906220 557
ioapic1 pin 14 153885893 57
ioapic0 pin 17 2731 0
ioapic3 pin 0 13 0
ioapic3 pin 1 14 0
ioapic0 pin 21 789 0
ioapic0 pin 20 15285596 5
ioapic0 pin 14 1 0
ioapic0 pin 4 0 0
ioapic0 pin 3 353 0
Total 2715852099 1015
Notice we have both 'ioapic0 pin 14' and 'ioapic1 pin 14',
what number would be used for intrctl in this case ?
Currently intrctl use "14" for both 'ioapic0' and 'ioapic1', this is
a bug...
I fix it as intrctl use "14" for 'ioapic0' on the other hand "38" for
'ioapic1'. "38" for 'ioapic1' is defined following
14 (pin number) + 24 * 1 (offset for secondary ioapic)
# in the case of 'ioapic3 pin 1', 1 + 24 * 3 = 73 is used by intrctl
An ioapic has 24 pins, so I think adding 24 can avoid duplication. However
it may be not so good. I consider a little more about this implementation.

By the way, I haven't seen ioapic*3*. Could you tell me your machine's
model number?

Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Manuel Bouyer
2014-08-21 11:07:11 UTC
Permalink
Post by Kengo NAKAHARA
Hi,
Post by Kengo NAKAHARA
interrupt total rate
TLB shootdown 788900019 295
cpu0 timer 265870470 99
ioapic0 pin 9 0 0
ioapic0 pin 16 1491906220 557
ioapic1 pin 14 153885893 57
ioapic0 pin 17 2731 0
ioapic3 pin 0 13 0
ioapic3 pin 1 14 0
ioapic0 pin 21 789 0
ioapic0 pin 20 15285596 5
ioapic0 pin 14 1 0
ioapic0 pin 4 0 0
ioapic0 pin 3 353 0
Total 2715852099 1015
Notice we have both 'ioapic0 pin 14' and 'ioapic1 pin 14',
what number would be used for intrctl in this case ?
Currently intrctl use "14" for both 'ioapic0' and 'ioapic1', this is
a bug...
I fix it as intrctl use "14" for 'ioapic0' on the other hand "38" for
'ioapic1'. "38" for 'ioapic1' is defined following
14 (pin number) + 24 * 1 (offset for secondary ioapic)
# in the case of 'ioapic3 pin 1', 1 + 24 * 3 = 73 is used by intrctl
An ioapic has 24 pins, so I think adding 24 can avoid duplication. However
it may be not so good. I consider a little more about this implementation.
yes, please. Using this IRQ number is not straitforward and error-prone.
Using the same values as vmstat -i would make it much easier.
Post by Kengo NAKAHARA
By the way, I haven't seen ioapic*3*. Could you tell me your machine's
model number?
It's a DELL poweredge 2950:
NetBSD 6.0_STABLE (SWING) #1: Thu Jan 10 13:23:17 MET 2013
***@disco.soc.lip6.fr:/home/bouyer/tmp/i386/obj/home/bouyer/src-6/src/sys/arch/i386/compile/SWING
total memory = 3327 MB
avail memory = 3265 MB
timecounter: Timecounters tick every 10.000 msec
timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
Dell Inc. PowerEdge 2950
mainbus0 (root)
cpu0 at mainbus0 apid 0: Intel(R) Xeon(R) CPU 5140 @ 2.33GHz, id 0x6f6
cpu1 at mainbus0 apid 6: Intel(R) Xeon(R) CPU 5140 @ 2.33GHz, id 0x6f6
cpu2 at mainbus0 apid 1: Intel(R) Xeon(R) CPU 5140 @ 2.33GHz, id 0x6f6
cpu3 at mainbus0 apid 7: Intel(R) Xeon(R) CPU 5140 @ 2.33GHz, id 0x6f6
ioapic0 at mainbus0 apid 8: pa 0xfec00000, version 20, 24 pins
ioapic1 at mainbus0 apid 9: pa 0xfec81000, version 20, 24 pins
ioapic2 at mainbus0 apid 10: pa 0xfec84000, version 20, 24 pins
ioapic3 at mainbus0 apid 11: pa 0xfec84800, version 20, 24 pins
--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
Kengo NAKAHARA
2014-08-22 02:53:16 UTC
Permalink
Hi,
Post by Manuel Bouyer
Post by Kengo NAKAHARA
Currently intrctl use "14" for both 'ioapic0' and 'ioapic1', this is
a bug...
I fix it as intrctl use "14" for 'ioapic0' on the other hand "38" for
'ioapic1'. "38" for 'ioapic1' is defined following
14 (pin number) + 24 * 1 (offset for secondary ioapic)
# in the case of 'ioapic3 pin 1', 1 + 24 * 3 = 73 is used by intrctl
An ioapic has 24 pins, so I think adding 24 can avoid duplication. However
it may be not so good. I consider a little more about this implementation.
yes, please. Using this IRQ number is not straitforward and error-prone.
Using the same values as vmstat -i would make it much easier.
To reconsider this matter, I think intrctl should use interrupt names like
"ioapic1 pin 14". So, I change the usage of "intrctl affinity" to following
intrctl affinity -i "interrupt_name" -c "cpuid"
e.g. # intrctl affinity -c 0 -i "ioapic1 pin 14"

In addition, "intrctl list" show like following
====================
intr name CPU#00(+) CPU#02(+)
ioapic0 pin 19 1777* 0 ahcisata0
ioapic0 pin 20 0* 0 wm0
ioapic1 pin 14 0* 0 wm1
====================

Could you comment this specification?
As you indicated, it is better to use interrupt names :)
I think using string let intrctl interface be MI. Of course, each
architectures must implement parsing interrupt name, but I think
the implementation is not so hard.

Therefore, I also consider about using device names. I think device names
is more MI than interrupt names. I plan to support using device names.
Post by Manuel Bouyer
Post by Kengo NAKAHARA
By the way, I haven't seen ioapic*3*. Could you tell me your machine's
model number?
NetBSD 6.0_STABLE (SWING) #1: Thu Jan 10 13:23:17 MET 2013
Thank you for information. I try to get or borrow it.


Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Manuel Bouyer
2014-08-22 07:29:56 UTC
Permalink
Post by Kengo NAKAHARA
[...]
To reconsider this matter, I think intrctl should use interrupt names like
"ioapic1 pin 14". So, I change the usage of "intrctl affinity" to following
intrctl affinity -i "interrupt_name" -c "cpuid"
e.g. # intrctl affinity -c 0 -i "ioapic1 pin 14"
In addition, "intrctl list" show like following
====================
intr name CPU#00(+) CPU#02(+)
ioapic0 pin 19 1777* 0 ahcisata0
ioapic0 pin 20 0* 0 wm0
ioapic1 pin 14 0* 0 wm1
====================
Could you comment this specification?
this looks better, thanks !
--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
Thor Lancelot Simon
2014-08-21 14:36:34 UTC
Permalink
Post by Kengo NAKAHARA
Currently intrctl use "14" for both 'ioapic0' and 'ioapic1', this is
a bug...
I fix it as intrctl use "14" for 'ioapic0' on the other hand "38" for
'ioapic1'. "38" for 'ioapic1' is defined following
I think this is extremely counterintuitive. I think strings should be
used in this interface.

Thor
Kengo NAKAHARA
2014-08-22 02:59:57 UTC
Permalink
Hi,
Post by Thor Lancelot Simon
Post by Kengo NAKAHARA
Currently intrctl use "14" for both 'ioapic0' and 'ioapic1', this is
a bug...
I fix it as intrctl use "14" for 'ioapic0' on the other hand "38" for
'ioapic1'. "38" for 'ioapic1' is defined following
I think this is extremely counterintuitive. I think strings should be
used in this interface.
I reconsider this matter, I also think intrctl should use strings.
Please see
http://mail-index.netbsd.org/tech-kern/2014/08/22/msg017535.html

Could you comment this specification?

Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Thor Lancelot Simon
2014-08-22 03:14:57 UTC
Permalink
Post by Kengo NAKAHARA
I reconsider this matter, I also think intrctl should use strings.
Please see
http://mail-index.netbsd.org/tech-kern/2014/08/22/msg017535.html
Could you comment this specification?
I think it's better.

Thor
Kengo NAKAHARA
2014-09-12 02:47:03 UTC
Permalink
Hi,
Post by Kengo NAKAHARA
I implement "intrctl" for amd64 and i386. Here is the implementation,
https://github.com/knakahara/netbsd-src/compare/rfc/irq-affinity2
and here is the patch
http://knakahara.github.io/patches/netbsd/irq-affinity-initctl.patch
I implement MI version intrctl. Here is the implementation,
https://github.com/knakahara/netbsd-src/compare/rfc/irq-affinity3
and here is the patch
http://knakahara.github.io/patches/netbsd/irq-affinity-mi.patch

I move MI code to sys/kern/subr_intr.c, of course the code use
intr_distribute() and other MI interfaces.

This version show interrupt list by interrupt name and MI cpu index.
==========
# intrctl list
interrupt name CPU#00(+) CPU#01(-) CPU#02(-) CPU#03(-) CPU#04(+) CPU#05(+) CPU#06(+) CPU#07(+)
ioapic0 pin 3 0* 0 0 0 0 0 0 0 unknown
ioapic0 pin 4 0* 0 0 0 0 0 0 0 unknown
ioapic0 pin 18 0* 0 0 0 0 0 0 0 unknown
ioapic0 pin 19 1648* 0 0 34 0 0 0 0 unknown, unknown
ioapic0 pin 23 375 0 0 270 273* 0 0 0 unknown, unknown
ioapic0 pin 22 0* 0 0 0 0 0 0 0 unknown
ioapic0 pin 21 0* 0 0 0 0 0 0 0 unknown
ioapic0 pin 20 0* 0 0 0 0 0 0 0 unknown
ioapic0 pin 9 0* 0 0 0 0 0 0 0 unknown
==========

And affinity syntax is here.
intrctl affinity -c 7 -i 'ioapic0 pin 23'
7 : MI cpu index (not cpuid)
'ioapic0 pin 23' : interrupt name same as vmstat -i

Therefore, I test on FUJITSU PRIMERGY RX200 S6 which has two ioapics.
It works well to set affinity 'ioapic1 pin 0' and 'ioapic1 pin 4'.

Could you comment the implementation?

Thanks,
--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Core Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>
Loading...