[PATCH] GOP_ALLOC and fallocate for PUFFS

Discussion:

Emmanuel Dreyfus

2014-09-30 13:44:59 UTC

Hello

When a PUFFS filesystem uses the page cache, data enters the
cache with no guarantee it will be flushed. If it cannot be flushed
(bcause PUFFS write requests get EDQUOT or ENOSPC), then the
kernel will loop forever trying to flush data from the cache,
and the filesystem cannot be unmounted without -f (and data loss).

In the attached patch, I add in PUFFS:
- support for the fallocate operation
- a puffs_gop_alloe() function that use fallocate
- when writing through the page cache we call first GOP_ALLOC to make
sure backend storage is allocated for the data we cache. debug printf
show a sane behavior, GOP_ALLOC calling puffs_gop_alloc only when required.

If the filesystem does not implement fallocate, we keep the current
behavior of filling the page cache with data we are not sure we can flush.
Perhaps we can improve further: missing fallocate can be emulated by
writing zeroed chuncks. I have implemented that in libperfuse, but
we may want to have this in libpuffs, enabled by a mount option. Input
welcome.

--
Emmanuel Dreyfus
***@netbsd.org

Antti Kantee

2014-09-30 15:30:05 UTC

Permalink

Post by Emmanuel Dreyfus
Hello
When a PUFFS filesystem uses the page cache, data enters the
cache with no guarantee it will be flushed. If it cannot be flushed
(bcause PUFFS write requests get EDQUOT or ENOSPC), then the
kernel will loop forever trying to flush data from the cache,
and the filesystem cannot be unmounted without -f (and data loss).
- support for the fallocate operation
- a puffs_gop_alloe() function that use fallocate
- when writing through the page cache we call first GOP_ALLOC to make
sure backend storage is allocated for the data we cache. debug printf
show a sane behavior, GOP_ALLOC calling puffs_gop_alloc only when required.
If the filesystem does not implement fallocate, we keep the current
behavior of filling the page cache with data we are not sure we can flush.
Perhaps we can improve further: missing fallocate can be emulated by
writing zeroed chuncks. I have implemented that in libperfuse, but
we may want to have this in libpuffs, enabled by a mount option. Input
welcome.

Is it really better to sync fallocate, put stuff in the page cache and
flush the page cache some day instead of just having a write-through (or
write-first) page cache on the write() path? You also get rid of the
fallocate-not-implemented problem that way.

That still leaves the mmap path ... but mmap always causes annoying
problems and should just die ;)

Writing zeroes might be a bad emulation for distributed file systems,
though I guess you're the expert in that field and can evaluate the
risks better than me.

Emmanuel Dreyfus

2014-09-30 16:24:14 UTC

Permalink

GOP_ALLOC calls puffs_gop_alloc for chunks bigger than pages
(I observed 1 MB for now). If we have fallocate implemented in the
filesystem, this is really efficient, since fallocate saves us from
sending any data to write. Hence IMO fallocate should be the preferred
way if available.

But if it is not there, indeed, doing a write on first attemps should
do the trick.

Writing zeroes might be a bad emulation for distributed file systems, though
I guess you're the expert in that field and can evaluate the risks better
than me.

I understand that areas fallocate'd should return zeroes, so it should be
fine. The real problem is performances. I am not sure what approach is best.

I first though about a puffs_gop_alloc like below, but that will not work,
as VOP_PUTPAGES goes to genfs_putpages, which calls
GOP_WRITE (genfs_gop_write), which calls VOP_STRATEGY without checking
for failure. Should I directly call VOP_STRATEGY?

int
puffs_gop_alloc(struct vnode *vp, off_t off, off_t len,
int flags, kauth_cred_t cred)
{
int error;

if (EXISTSOP(pmp, FALLOCATE))
return _puffs_vnop_fallocate(vp, off, len);
else
return VOP_PUTPAGES(vp, off, off + len,
PGO_CLEANIT|PGO_SYNCIO);
}

--
Emmanuel Dreyfus
***@netbsd.org

J. Hannken-Illjes

2014-09-30 16:38:56 UTC

Permalink

Post by Emmanuel Dreyfus

GOP_ALLOC calls puffs_gop_alloc for chunks bigger than pages
(I observed 1 MB for now). If we have fallocate implemented in the
filesystem, this is really efficient, since fallocate saves us from
sending any data to write. Hence IMO fallocate should be the preferred
way if available.
But if it is not there, indeed, doing a write on first attemps should
do the trick.

Writing zeroes might be a bad emulation for distributed file systems, though
I guess you're the expert in that field and can evaluate the risks better
than me.

genfs_gop_write calls genfs_do_io which does "error = biowait(mbp);"
near the end. This will catch errors from VOP_STRATEGY.

But why do you need GOP_ALLOC? Is there a consumer beside genfs_getpages
filling holes? Puffs doesn't return holes as its VOP_BMAP always
returns valid ( != -1 ) block addrs.

Post by Emmanuel Dreyfus
Should I directly call VOP_STRATEGY?
int
puffs_gop_alloc(struct vnode *vp, off_t off, off_t len,
int flags, kauth_cred_t cred)
{
int error;
if (EXISTSOP(pmp, FALLOCATE))
return _puffs_vnop_fallocate(vp, off, len);
else
return VOP_PUTPAGES(vp, off, off + len,
PGO_CLEANIT|PGO_SYNCIO);
}
--
Emmanuel Dreyfus

--
J. Hannken-Illjes - ***@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Emmanuel Dreyfus

2014-09-30 18:50:49 UTC

Permalink

Post by J. Hannken-Illjes
But why do you need GOP_ALLOC? Is there a consumer beside genfs_getpages
filling holes? Puffs doesn't return holes as its VOP_BMAP always
returns valid ( != -1 ) block addrs.

Well, I cannot use VOP_PUTPAGES if there are no pages mapped for the
file, can I?

--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
***@netbsd.org

Emmanuel Dreyfus

2014-10-02 04:45:12 UTC

Permalink

Post by J. Hannken-Illjes
genfs_gop_write calls genfs_do_io which does "error = biowait(mbp);"
near the end. This will catch errors from VOP_STRATEGY.

I run the code below but VOP_PUTPAGES never return anything else than 0.

int
puffs_gop_alloc(struct vnode *vp, off_t off, off_t len,
int flags, kauth_cred_t cred)
{
struct puffs_mount *pmp = MPTOPUFFSMP(vp->v_mount);
off_t start, end;
int pgo_flags = PGO_CLEANIT|PGO_SYNCIO|PGO_PASTEOF;
int u_flags = PUFFS_UPDATESIZE|PUFFS_UPDATEMTIME|PUFFS_UPDATECTIME;
int error;

if (EXISTSOP(pmp, FALLOCATE)) {
error = _puffs_vnop_fallocate(vp, off, len);
goto out;
}

start = trunc_page(off);
end = round_page(off + len);

if (off + len > vp->v_size)
uvm_vnp_setwritesize(vp, off + len);

mutex_enter(vp->v_interlock);
error = VOP_PUTPAGES(vp, start, end, pgo_flags);

if (off + len > vp->v_size) {
if (error == 0) {
uvm_vnp_setsize(vp, off + len);
puffs_updatenode(VPTOPP(vp), u_flags, vp->v_size);
} else {
uvm_vnp_setwritesize(vp, vp->v_size);
}
}
out:
return error;
}

--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
***@netbsd.org

J. Hannken-Illjes

2014-10-02 10:02:39 UTC

Permalink

Post by Emmanuel Dreyfus

Post by J. Hannken-Illjes
genfs_gop_write calls genfs_do_io which does "error = biowait(mbp);"
near the end. This will catch errors from VOP_STRATEGY.

I run the code below but VOP_PUTPAGES never return anything else than 0.

Sure -- why should it?

GOP_ALLOC() gets called from VOP_GETPAGES() for missing pages. Here you
run VOP_PUTPAGES() on a range known to be unmapped so it becomes a NOP.

GOP_ALLOC() aka puffs_gop_alloc() has to run on the client to make
sure the pages in question are allocated and may be faulted in.

The client has to fill holes or extend files.

Post by Emmanuel Dreyfus
int
puffs_gop_alloc(struct vnode *vp, off_t off, off_t len,
int flags, kauth_cred_t cred)
{
struct puffs_mount *pmp = MPTOPUFFSMP(vp->v_mount);
off_t start, end;
int pgo_flags = PGO_CLEANIT|PGO_SYNCIO|PGO_PASTEOF;
int u_flags = PUFFS_UPDATESIZE|PUFFS_UPDATEMTIME|PUFFS_UPDATECTIME;
int error;
if (EXISTSOP(pmp, FALLOCATE)) {
error = _puffs_vnop_fallocate(vp, off, len);
goto out;
}
start = trunc_page(off);
end = round_page(off + len);
if (off + len > vp->v_size)
uvm_vnp_setwritesize(vp, off + len);
mutex_enter(vp->v_interlock);
error = VOP_PUTPAGES(vp, start, end, pgo_flags);
if (off + len > vp->v_size) {
if (error == 0) {
uvm_vnp_setsize(vp, off + len);
puffs_updatenode(VPTOPP(vp), u_flags, vp->v_size);
} else {
uvm_vnp_setwritesize(vp, vp->v_size);
}
}
return error;
}

--
J. Hannken-Illjes - ***@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Emmanuel Dreyfus

2014-10-02 15:23:51 UTC

Permalink

Post by J. Hannken-Illjes
GOP_ALLOC() gets called from VOP_GETPAGES() for missing pages. Here you
run VOP_PUTPAGES() on a range known to be unmapped so it becomes a NOP.
GOP_ALLOC() aka puffs_gop_alloc() has to run on the client to make
sure the pages in question are allocated and may be faulted in.

You mean I have to run VOP_GETPAGES at the beginning of a cached
write operation?

--
Emmanuel Dreyfus
***@netbsd.org

J. Hannken-Illjes

2014-10-02 15:46:50 UTC

Permalink

Post by Emmanuel Dreyfus

You mean I have to run VOP_GETPAGES at the beginning of a cached
write operation?

Please describe "cached write operation" in terms of vnode operations.

Which vnode operation finally calls GOP_ALLOC()?

--
J. Hannken-Illjes - ***@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Emmanuel Dreyfus

2014-10-02 17:09:56 UTC

Permalink

Post by J. Hannken-Illjes
Please describe "cached write operation" in terms of vnode operations.

A write on a mount that uses page cache, without direct I/O.

Post by J. Hannken-Illjes
Which vnode operation finally calls GOP_ALLOC()?
From genfs, only VOP_GETPAGES, but I understand we should call it on our

own. For instance, UFS calls it through ufs_balloc_range() in VOP_WRITE.

--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
***@netbsd.org

J. Hannken-Illjes

2014-10-02 17:40:49 UTC

Permalink

Post by Emmanuel Dreyfus

Post by J. Hannken-Illjes
Please describe "cached write operation" in terms of vnode operations.

A write on a mount that uses page cache, without direct I/O.

Post by J. Hannken-Illjes
Which vnode operation finally calls GOP_ALLOC()?

From genfs, only VOP_GETPAGES, but I understand we should call it on our
own. For instance, UFS calls it through ufs_balloc_range() in VOP_WRITE.

Ok -- if you want to use it, you have to implement it on the client side
as puffs has no idea how to allocate blocks -- right?

--
J. Hannken-Illjes - ***@eis.cs.tu-bs.de - TU Braunschweig (Germany)

Emmanuel Dreyfus

2014-10-02 18:54:16 UTC

Permalink

Post by J. Hannken-Illjes
Ok -- if you want to use it, you have to implement it on the client side
as puffs has no idea how to allocate blocks -- right?

Here is my GOP_ALLOC so far. If available, use fallocate, otherwise
write zeroes. puffs_vnop_write() was split into puffs_vnop_write_cache()
and puffs_vnop_write_fs()

int
puffs_gop_alloc(struct vnode *vp, off_t off, off_t len,
int flags, kauth_cred_t cred)
{
struct puffs_mount *pmp = MPTOPUFFSMP(vp->v_mount);
int uflags = 0;
void *zbuf;
struct iovec iov;
struct uio uio;
int error;

if (EXISTSOP(pmp, FALLOCATE)) {
error = _puffs_vnop_fallocate(vp, off, len);
goto out;
}

zbuf = kmem_alloc(len, KM_SLEEP);

iov.iov_base = zbuf;
iov.iov_len = len;

UIO_SETUP_SYSSPACE(&uio);
uio.uio_iov = &iov;
uio.uio_iovcnt = 1;
uio.uio_offset = off;
uio.uio_resid = len;
uio.uio_rw = UIO_WRITE;

error = _puffs_vnop_write_fs(vp, &uio, IO_SYNC, cred, &uflags);
if (error == 0)
puffs_updatenode(VPTOPP(vp), uflags, vp->v_size);

kmem_free(zbuf, len);

out:
return error;
}

--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
***@netbsd.org

Emmanuel Dreyfus

2014-10-03 00:09:33 UTC

Permalink

Here is my latest patch:
http://ftp.espci.fr/shadow/manu/puffs-falloc.patch

The whole thing is not very satisfying. Here are a few issues:

- if we do not have fallocate and go the emulation way writing zeroes,
we must first call VOP_PUTPAGES for the whole vnode so that no data
remain in th cache while we try adding new blocks. This means that even
writing a single byte will cause the whole file to be flushed,
completely defeating write cache.

- I allocate file storage for any cached write, without actually cheking
if the written area are already mapped to allocated storage or not. How
could I check that?

Generally speaking I start wondering if it is the kernel job to maitain
the PUFFS write cache.

--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
***@netbsd.org

Emmanuel Dreyfus

2014-10-04 04:52:07 UTC

Permalink

Post by Emmanuel Dreyfus
When a PUFFS filesystem uses the page cache, data enters the
cache with no guarantee it will be flushed. If it cannot be flushed
(bcause PUFFS write requests get EDQUOT or ENOSPC), then the
kernel will loop forever trying to flush data from the cache,
and the filesystem cannot be unmounted without -f (and data loss).

After some thoughs about the problem, I am not sure how it can be
solved.

For a given write, PUFFS does not knows if the backed storage is
allocated or not. Of course there is the obvious case of writing beyond
EOF, but in the general case we do not know if we write to a hole or to
an already written area.

If we want to ensure backend allocation is done, we must do a
write-first to the filesystem before filling the cache. But when writing
before EOF, we must first read data from file (possibly getting zeroes
from unallocated area, or real data), then write. And since we have no
way of tracking where we already did a write, we are going to do it for
any write, even for a single byte.

This seems to completely defeats the purpose of the page cache.

Of course things are different if we have fallocate, but it seems not
acehivable in the near future for FFS.

--
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
***@netbsd.org