EINVAL from copyin/out

Discussion:

EINVAL from copyin/out - how?

Robert Elz

2014-08-12 15:25:30 UTC

I am trying to debug a problem with cd9660 fs reading in NetBSD current
(and I presume in 7 BETA) - some files generate EINVAL when read (on
older NetBSDs the same disc reads just fine).

I have traced the source of the EINVAL to copyout (or copyin when using
a utility that reads files using mmap()/write()), but I cannot find any way for
copyout (or copyin) to ever return anything other than success or EFAULT,
so I am currently lost explaining the EINVAL...

Can anyone suggest where the EINVAL may be coming from, and what potential
causes might be? copyout(9) only mentions EFAULT as a possible error (or
does in the older man pages I have available to read at the minute.)

This is on amd64 - I am actually debugging on 6.99.47 kernel sources (a
version I had lying around convenient) - but the same error occurs on 6.99.49
and going back quite a few versions (I think I may have first seen it in
6.99.30 or there about - but I don't run anything even close to all
versions, so this is no evidence as to when the problem started - I am
guessing at the changes to the cd9660 fs code several months ago. The disc
(a DVD, but I have also see it on BD) reads fine on a 6.99.17 i386 system.

Most files read OK, only a few have "issues" (error is always EINVAL).
Directory reads all look fine, only file reds give problems.

Tracing goes from cd9660_read() (error detected from call to ubc_uiomove())
down to copyout_vmspace() where the error comes from the copyout() call
that is done when the current process is the one whose vmspace is wanted.

That is ...

if (__predict_true(vm == curproc->p_vmspace)) {
return copyout(kaddr, uaddr, len);
}

those few lines end up doing return(EINVAL).

How?

copyout is from amd64/copy.S and while my x86 assembler skill is close to
0, even I can grep for EINVAL, 22, and 16, and upon finding none of those
(anywhere in the file) conclude that it would be difficult for copyout to
actually return EINVAL.

Any clues? If I can understand this, I may be able to track down what
changed between 6.99.17 and current that is causing these problems.

kre

ps: the disc (DVD or BD) in question was generated on NetBSD, using
mkisofs (from pkgsrc) - more or less using the cddist script from the
pkgtools/cdpack package (the mkisofs call is essentially identical,
I get the files into the directories to put on the DVD in a slightly
different manner than the script does. It is also not just one disc
that might have some weird issue - many discs I have made using the same
procedure have similar problems (EINVAL always) - even my 6.99.47 install
DVD (on which I include source sets) gave me EINVAL when I was trying to
extract the sources to start this debug ... I used the older kernel to
extract them.

matthew green

2014-08-12 18:38:19 UTC

Permalink

no idea on the real problem, but to answer why you might get
strange returns from copyin/copyout().

these can trigger page faults, which in turn can trigger paging
(in this case, most likely from swap.) the result of that is
eventually handed back to copyout(), so you kind of have to read
much of uvm and the platform pmap, plus IO code, to see all the
potential places an error might be generated.

that said, i've never seen EINVAL before.

.mrg.

Robert Elz

2014-08-13 00:28:52 UTC

Permalink

Date: Wed, 13 Aug 2014 04:38:19 +1000
From: matthew green <***@eterna.com.au>
Message-ID: <***@splode.eterna.com.au>

| these can trigger page faults, which in turn can trigger paging
| (in this case, most likely from swap.)

Unlikely in this case, and I can easily arrange for no swap space to
validate that - the test is to boot, mount the cd (dvd), cd to it, and
tar cf - . >/dev/null

Page faults are entirely possible, but they should be zfod pages being
created, rather than anything read from swap (there's plenty of ram
available, and fortunately, on this particular dvd, the bad files are
found very early - if after that, I let tar continue, it will happily
read lots more from the dvd and happily write it to /dev/null).

| the result of that is eventually handed back to copyout(),

It would be useful to have copyout(9) at least mention whatever error
codes are possible ...

I am going to keep hunting - when I first started looking, I was expecting
to find the EINVAL being generated by something looking at the structure of
the filesystem on the DVD, and then perhaps to initiate a "discussion" on
whether it was mkisofs or cd9660fs interpreting the spec in an unexpected
way ... but that isn't how it turned out.

kre

Robert Elz

2014-08-15 07:59:38 UTC

Permalink

Date: Wed, 13 Aug 2014 07:28:52 +0700
From: Robert Elz <***@munnari.OZ.AU>
Message-ID: <***@perseus.noi.kre.to>

| I am going to keep hunting [...]

I have added port-amd64 to the To list, since it might help to get
amd64 expertise to make some suggestions... (you can look for the
few earlier messages on tech-kern - or a very brief summary is that
copyout (for read sys calls) or copyin (write of mmap'd file) generating
EINVAL when attempting to read (some) files from a cd9660 filesystem on a DVD).

I have now tried this on an i386 (relatively) current kernel, and
get no problems reading the DVD there - amd64 is the only other
architecture I have available that I can reasonably test on...

I have also found that the problem is non-deterministic, sometimes it
all works just fine. I found a DVD image (just 2.1GiB, so a fairly
small one...) lying around, that I also had burned to a DVD. That one
(the DVD) also gave errors (and it seems as if all the errors - the EINVAL
on copyout) happen on files in 3rd level directories (don't know about
deeper, there were none of those - but it looked as if all the files in
the 3rd level directories (that is, root of the dvd, 2nd level subdir made
in root, 3rd level subdir made in the 2nd level subdir - files in the 3rd
level subdir) caused EINVAL when tar attempts to read them (or cp attempts to
write files from the dvd that have been mmap'd). The EINVAL comes from the
read (r write) of the first block.

I copied the image file to the test system, vnconfig, mount, and no problems
reading the files from that - that was the image that was written to the DVD
(not recreated or anything like that).

At this point it looks to be some kind of setup problem on amd64, when it
reads nested directories, building some data struct that resuts in EINVAL
from the copyout ... (I still have no idea where the EIVAL originates.)

However, just for "fun" I thought I'd do a diff -r of the mount points
of the real DVD (/cdrom) and the mounted vnd0a (/mnt) - that completed
without error.

What's more, after that, tar had no problem reading the DVD either!
(I had not tried tar on this image since the test system was rebooted).

I'm still looking for suggestions as to likely (or even possible) sources
of EINVAL from copyout - copyout itself never generates that, so some kind
of fault must be setting it - but what? If anyone has any suggestiions for
possible sources, I'll happily mangle my kernel and have it printf whenever
EINVAL is generated (in a plausible place - I don't have the patience to
go adding printfs to every place EIVAL occurs anywhere in the kernel... just
the places that might make sense.)

kre

ps: I can make the 2.1GiB .iso image available, privately, to anyone who
wants to take a look .. it is just pkgsrc distfiles - I have not checked
though to verify that none of them have "no redistribution" licences, so I
am not going to put it anywhere public. E-mail me privately if you want
a URL from which you can ftp the image (and be prepared for it to take quite
a while to copy). I attempted to test this with a very small test DVD
(juft a few hundred KB in all) but that wouldn't fail for me ... Of course,
the non-deterinism makes it hard to be certain of anything - all I know for
sure is that I have seen the DVD made from the image in question generate
EINVAL when read with tar on a 4.99.49 amd64 kernel.

Thomas Schmitt

2014-08-15 08:56:07 UTC

Permalink

Hi,

i'm not a NetBSD expert, but had reason to look into its cd9660
code for developing two PRs (48808, 48959).

Post by Robert Elz
At this point it looks to be some kind of setup problem on amd64, when it
reads nested directories, building some data struct that resuts in EINVAL
from the copyout

I did not see any architecture specific code in cd9660.

Given the fact that you located the origin of EINVAL outside
of cd9660 (resp. underneath), i expect that ISO 9660 structure
aspects are only indirectly involved.
Especially the difference between i386 and amd64 can hardly
be explained by ISO 9660 aspects.

The connection between ISO 9660 and data file content blocks is
made by cd9660_bmap() as implementation of VOP_BMAP(9).

Post by Robert Elz
diff -r of the mount points of the real DVD (/cdrom) and the
mounted vnd0a (/mnt) - that completed without error.
What's more, after that, tar had no problem reading the DVD either!

It is unlikely that the DVD delivers varying data blocks
from the ISO 9660 directory files, which would lure cd9660_bmap()
into requesting invalid data block addresses for file content.

My favority suspect would now be the code underneath bread(9)
and especially the mechanism which looks up the buffer and
decides whether a physical read operation on the DVD is needed.

Post by Robert Elz
If anyone has any suggestiions for
possible sources, I'll happily mangle my kernel

How about having a wrapper around bread(9), which checks for
error replies of bread(9) and eventually prints some message
which tells the failed block number.
Then use that wrapper instead of all the bread(9) calls in cd9660.
(Should be possible with a few vi commands on the few cd9660*.c
files.)
This would make clear whether the problem is indeed underneath
bread(9) and whether it gets an implausible block number from
cd9660.

If the block number of a failure is in the size range of the ISO
image, the its content could be looked up by help of dd and be made
human readable by od or alike. Directory records show some
characteristic redundancy. So there is hope we can tell whether
it is metadata or data file content.
If the content is inconclusive, then i'd need the whole ISO
image in order to tell what the affected block shall mean.

Alternatively to an inspection of the ISO image, or if the block
number is implausible, you could equip the wrapper and its calls
with a string parameter which identifies the wrapper's caller.
So we could make a connection to particular cd9660 operations
and see whether it is always the same cd9660 gesture failing.

Post by Robert Elz
ps: I can make the 2.1GiB .iso image available,

If you have an image which never shows problems and one that
reliably shows problems (e.g. on the first tar) then we could
look at the block addresses of the data files and directory
files which are involved.

As said, it would be interesting to look up the blocks of
failing bread(9) calls.

Have a nice day :)

Thomas

Robert Elz

2014-08-15 09:51:24 UTC

Permalink

Date: Fri, 15 Aug 2014 10:56:07 +0200
From: "Thomas Schmitt" <***@gmx.net>
Message-ID: <***@scdbackup.webframe.org>

| Given the fact that you located the origin of EINVAL outside
| of cd9660 (resp. underneath), i expect that ISO 9660 structure
| aspects are only indirectly involved.

I agree, particularly as it works sometimes...

| Especially the difference between i386 and amd64 can hardly
| be explained by ISO 9660 aspects.

Unless there was some blatant unportable architecture dependant code
(assuming sizeof(char *) or something) - I agree, and if there were
I'd expect it to never work, or never work for some particular files,
or something like that.

| My favority suspect would now be the code underneath bread(9)
| and especially the mechanism which looks up the buffer and
| decides whether a physical read operation on the DVD is needed.

Could be ... but (and this may be just the time since I really understood
how kernels worked has been so long that my assumptions no matter match
reality) the reason I'm concentrating on the fults when using tar is that
the tar file read loop is basically

while ((n = read(fd, buf, bufzize)) > 0)
write(ofd, buf, n);

along with the stuff that writes the header before each file, and makes
sure then that the output file contains exactly the number of bytes that
the header said it would contain (no more, no less, even if the file
changes size while the loop is happening). All that is working fine
(ie: tar does exactly what it should be doing given what the kernel is doing).

The read() returns -1, with errno == EINVAL (on the first read of one of
the affected fies). Here is where I may not be understanding the kernel
operation as it currently is ... for a read (after validating the fd, etc)
the kernell (as I remember it) goes through the filesystem, using the vnode,
gets the appropriate data (from the cache, or by accessing the device driver
for the underlying device upon whilch the filesystem is mounted). During
this there can be EIO etc if things fail - but assuming it all succeeds
(and nly if it succeeds) the data (or perhaps just some of it if it was a
large read request - not relevent here, tar reads in reasonable buffer
sizes..) is iiving somewhere in kernel memory - the last step before returning
is to copy that out to user space - that's copyoyt()'s function (if the
user address is bogus, EFAULT will happen - tar isn't passing bogus addresses
and there are no EFAULTs). There's no guarantee that the user's buffer
is all in mapped memory (in fact, for some of my tests, I'd be surprised if
any of it was, unless the same buffer had been used to construct and write the
file header) so zfod page faults might happen during the copyut (but those
should not in any way be affected by the source of the copy - the cd9660 fs
stuff).

That's where I don't see (currently) where the EINVAL can be coming from.

| How about having a wrapper around bread(9), which checks for
| error replies of bread(9) and eventually prints some message
| which tells the failed block number.

Sure, I can do that - but if bread() failed, I'd expect to see EIO, not
EINVAL (normally anyway) - and unless in these modern kernels the bread()
is somehow being deferred until the cata is needed for copyout() I can't
see how it could possibly be relevant. But it is certainly an easy thing
to try, so I'll do that.

| Alternatively to an inspection of the ISO image, or if the block
| number is implausible, you could equip the wrapper and its calls
| with a string parameter which identifies the wrapper's caller.

I'll do that for sure anyway - stray output without knowing the source
just causes problems and confusion later.

Thanks,

kre

Thomas Schmitt

2014-08-15 11:38:09 UTC

Permalink

Hi,

Post by Robert Elz
the tar file read loop is basically
while ((n = read(fd, buf, bufzize)) > 0)
write(ofd, buf, n);

That would be regular file content and thus should touch
cd9660_bmap(), which would translate the file-internal block
number to the partition/device block number.

man 9 VOP_BMAP says:
"In case the logical block is not allocated, -1 is used."
But cd9660_bmap() does never set *ap->a_bnp to this value.
It happily computes the device block number from file start
block and desired file-internal block number.
On the other hand, a sane ISO 9660 filesystem should not
cause any block to be read which lies outside its hosting
device.

I assume read(2) or its helpers call VOP_BMAP(9) with the
vnode of the data file and then bread(9) with the device
vnode and the converted block number.

There is also cd9660_read()/VOP_READ(9) which i do not really
understand in the (vp->v_type == VREG) case. Especially i do
not see where the necessary file-to-device mapping is done.

In any case, i'd deem some printfs in cd9660_bmap() worthwhile
as long as there are no better ideas.
They should tell about the eventual emergency exit:
if (ap->a_bnp == NULL)
return (0);
resp. about the sucessfully mapped parameters:
af->a_bn , *af->a_bnp, *af->a_runp (if af->a_runp != NULL)

One should also find out, whether cd9660_read() is called
in the course of read(2) on a regular file.
if (vp->v_type == VREG) {
const int advice = IO_ADV_DECODE(ap->a_ioflag);

printf("cd9660_read() called\n");

...
}
(In that case one would have to learn how this function actually
works by help of ubc_uiomove(9) resp. how ubc_uiomove(9) learns
about the data block numbers of the regular file.)

Post by Robert Elz
The read() returns -1, with errno == EINVAL

Well, cd9660 has a few EINVALs. But except one in cd9660_read()
if (uio->uio_offset < 0)
return (EINVAL);
i cannot spot any other which would be suspicious to happen
underneath read(2).
(The others have to do with mounting, readdir, readlink, ...)

And overall we still lack an architecture specific trigger
for any error condition in cd9660.

Have a nice day :)

Thomas

Robert Elz

2014-08-15 15:19:34 UTC

Permalink

Date: Fri, 15 Aug 2014 13:38:09 +0200
From: "Thomas Schmitt" <***@gmx.net>
Message-ID: <***@scdbackup.webframe.org>

| One should also find out, whether cd9660_read() is called
| in the course of read(2) on a regular file.

It is, that's where the calls to the functions that eventually call
copyout() occurs - it's that that returns the EINVAL to user space.

| (In that case one would have to learn how this function actually
| works by help of ubc_uiomove(9) resp. how ubc_uiomove(9) learns
| about the data block numbers of the regular file.)

In theory if it works the way I understand it should) the data should
already be present in buffers somewhere by the time it reaches there,
that's also where the error occurs... (cs9660_read() calls ubc_uimove()
which calls .. which calls .. to copyout).

| Well, cd9660 has a few EINVALs. But except one in cd9660_read()
| if (uio->uio_offset < 0)
| return (EINVAL);

No, it isn't that one - it is quite certain that it is happening inside
copyoyt (as the code - does "return copyout(k, ua, len);", and that
return is returning EINVAL (I changed it to
error = copyout(...);
if (error == EINVAL) printf();
return error;
and the printf fires).

Since copyout itself cannot return EINVAL (that is the .S code that implements
it), it must be happening (so it seems) via some trap, that returns with EINVAL
in ra - causing copyout to return EINVAL. The question is whether it is
some kind of trap related to reading the kernel address data (the address
itself looks OK, the printf prints the args to copyout), or a trap related
the write into user space I have no idea at the minute (it could even be some
unrelated interrupt not restoring registers properly - but given the way the
error occurs, when it occurs, that's incredibly unlikely.)

I did wrap all the bread() (and breadn()) calls in the cd9660
filesystem code (only the ones in that directory - occurrences in 4 files)
and tried it on one of my discs that previously always gave errors on
files in the sub-dirs... this time it read the whole thing perfectly
(all 4.3GB of it). No errors reported anywhere. So, I went back to
an unmodified kernel (before any of my trace hackery) which used to fail
every time, and that one also read the whole disc without errors!

I really hate non-deterministic bugs! Must be something related to the
phase of the moon!

Thanks anyway - assuming I can get it to start failing again, I'll keep
digging.

kre

Robert Swindells

2014-08-15 17:36:16 UTC

Permalink

Post by Robert Elz
Since copyout itself cannot return EINVAL (that is the .S code that implements
it), it must be happening (so it seems) via some trap, that returns with EINVAL
in ra - causing copyout to return EINVAL. The question is whether it is
some kind of trap related to reading the kernel address data (the address
itself looks OK, the printf prints the args to copyout), or a trap related
the write into user space I have no idea at the minute (it could even be some
unrelated interrupt not restoring registers properly - but given the way the
error occurs, when it occurs, that's incredibly unlikely.)

My understanding of the way the VM system works is that a pmap is in two
parts, a machine independent part and a part that contains whatever data
structures are required by the MMU.

The MI part is filled in when you make a call to map an area of memory,
the MD part is left unchanged at this time.

When you call copy{in,out}(), for the first time on a particular page
the MMU will cause the CPU to trap as that page will be missing from
the MD table, the trap handler will fix up the MD table for this page
using the info in the MI part then return back to the copy function
which will then retry what it was trying to do.