Chris Mason [Sat, 26 Apr 2014 12:02:03 +0000 (05:02 -0700)]
Btrfs: limit the path size in send to PATH_MAX
fs_path_ensure_buf is used to make sure our path buffers for
send are big enough for the path names as we construct them.
The buffer size is limited to 32K by the length field in
the struct.
But bugs in the path construction can end up trying to build
a huge buffer, and we'll do invalid memmmoves when the
buffer length field wraps.
This patch is step one, preventing the overflows.
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Thu, 24 Apr 2014 14:15:29 +0000 (15:15 +0100)]
Btrfs: correctly set profile flags on seqlock retry
If we had to retry on the profiles seqlock (due to a concurrent write), we
would set bits on the input flags that corresponded both to the current
profile and to previous values of the profile.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Thu, 24 Apr 2014 14:15:28 +0000 (15:15 +0100)]
Btrfs: use correct key when repeating search for extent item
If skinny metadata is enabled and our first tree search fails to find a
skinny extent item, we may repeat a tree search for a "fat" extent item
(if the previous item in the leaf is not the "fat" extent we're looking
for). However we were not setting the new key's objectid to the right
value, as we previously used the same key variable to peek at the previous
item in the leaf, which has a different objectid. So just set the right
objectid to avoid modifying/deleting a wrong item if we repeat the tree
search.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Miao Xie [Wed, 23 Apr 2014 11:33:36 +0000 (19:33 +0800)]
Btrfs: fix inode caching vs tree log
Currently, with inode cache enabled, we will reuse its inode id immediately
after unlinking file, we may hit something like following:
|->iput inode
|->return inode id into inode cache
|->create dir,fsync
|->power off
An easy way to reproduce this problem is:
mkfs.btrfs -f /dev/sdb
mount /dev/sdb /mnt -o inode_cache,commit=100
dd if=/dev/zero of=/mnt/data bs=1M count=10 oflag=sync
inode_id=`ls -i /mnt/data | awk '{print $1}'`
rm -f /mnt/data
i=1
while [ 1 ]
do
mkdir /mnt/dir_$i
test1=`stat /mnt/dir_$i | grep Inode: | awk '{print $4}'`
if [ $test1 -eq $inode_id ]
then
dd if=/dev/zero of=/mnt/dir_$i/data bs=1M count=1 oflag=sync
echo b > /proc/sysrq-trigger
fi
sleep 1
i=$(($i+1))
done
mount /dev/sdb /mnt
umount /dev/sdb
btrfs check /dev/sdb
We fix this problem by adding unlinked inode's id into pinned tree,
and we can not reuse them until committing transaction.
Cc: stable@vger.kernel.org
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Wed, 23 Apr 2014 11:33:35 +0000 (19:33 +0800)]
Btrfs: fix possible memory leaks in open_ctree()
Fix possible memory leaks in the following error handling paths:
read_tree_block()
btrfs_recover_log_trees
btrfs_commit_super()
btrfs_find_orphan_roots()
btrfs_cleanup_fs_roots()
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Wed, 23 Apr 2014 11:33:34 +0000 (19:33 +0800)]
Btrfs: avoid triggering bug_on() when we fail to start inode caching task
When running stress test(including snapshots,balance,fstress), we trigger
the following BUG_ON() which is because we fail to start inode caching task.
[ 181.131945] kernel BUG at fs/btrfs/inode-map.c:179!
[ 181.137963] invalid opcode: 0000 [#1] SMP
[ 181.217096] CPU: 11 PID: 2532 Comm: btrfs Not tainted 3.14.0 #1
[ 181.240521] task:
ffff88013b621b30 ti:
ffff8800b6ada000 task.ti:
ffff8800b6ada000
[ 181.367506] Call Trace:
[ 181.371107] [<
ffffffffa036c1be>] btrfs_return_ino+0x9e/0x110 [btrfs]
[ 181.379191] [<
ffffffffa038082b>] btrfs_evict_inode+0x46b/0x4c0 [btrfs]
[ 181.387464] [<
ffffffff810b5a70>] ? autoremove_wake_function+0x40/0x40
[ 181.395642] [<
ffffffff811dc5fe>] evict+0x9e/0x190
[ 181.401882] [<
ffffffff811dcde3>] iput+0xf3/0x180
[ 181.408025] [<
ffffffffa03812de>] btrfs_orphan_cleanup+0x1ee/0x430 [btrfs]
[ 181.416614] [<
ffffffffa03a6abd>] btrfs_mksubvol.isra.29+0x3bd/0x450 [btrfs]
[ 181.425399] [<
ffffffffa03a6cd6>] btrfs_ioctl_snap_create_transid+0x186/0x190 [btrfs]
[ 181.435059] [<
ffffffffa03a6e3b>] btrfs_ioctl_snap_create_v2+0xeb/0x130 [btrfs]
[ 181.444148] [<
ffffffffa03a9656>] btrfs_ioctl+0xf76/0x2b90 [btrfs]
[ 181.451971] [<
ffffffff8117e565>] ? handle_mm_fault+0x475/0xe80
[ 181.459509] [<
ffffffff8167ba0c>] ? __do_page_fault+0x1ec/0x520
[ 181.467046] [<
ffffffff81185b35>] ? do_mmap_pgoff+0x2f5/0x3c0
[ 181.474393] [<
ffffffff811d4da8>] do_vfs_ioctl+0x2d8/0x4b0
[ 181.481450] [<
ffffffff811d5001>] SyS_ioctl+0x81/0xa0
[ 181.488021] [<
ffffffff81680b69>] system_call_fastpath+0x16/0x1b
We should avoid triggering BUG_ON() here, instead, we output warning messages
and clear inode_cache option.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Wed, 23 Apr 2014 11:33:33 +0000 (19:33 +0800)]
Btrfs: move btrfs_{set,clear}_and_info() to ctree.h
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
David Sterba [Tue, 15 Apr 2014 16:50:17 +0000 (18:50 +0200)]
btrfs: replace error code from btrfs_drop_extents
There's a case which clone does not handle and used to BUG_ON instead,
(testcase xfstests/btrfs/035), now returns EINVAL. This error code is
confusing to the ioctl caller, as it normally signifies errorneous
arguments.
Change it to ENOPNOTSUPP which allows a fall back to copy instead of
clone. This does not affect the common reflink operation.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Qu Wenruo [Tue, 15 Apr 2014 02:41:00 +0000 (10:41 +0800)]
btrfs: Change the hole range to a more accurate value.
Commit
3ac0d7b96a268a98bd474cab8bce3a9f125aaccf fixed the btrfs expanding
write problem but the hole punched is sometimes too large for some
iovec, which has unmapped data ranges.
This patch will change to hole range to a more accurate value using the
counts checked by the write check routines.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Christoph Jaeger [Sat, 12 Apr 2014 11:33:13 +0000 (13:33 +0200)]
btrfs: fix use-after-free in mount_subvol()
Pointer 'newargs' is used after the memory that it points to has already
been freed.
Picked up by Coverity - CID
1201425.
Fixes: 0723a0473f ("btrfs: allow mounting btrfs subvolumes with
different ro/rw options")
Signed-off-by: Christoph Jaeger <christophjaeger@linux.com>
Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Fri, 11 Apr 2014 10:32:25 +0000 (18:32 +0800)]
Btrfs: fix compile warnings on on avr32 platform
fs/btrfs/scrub.c: In function 'get_raid56_logic_offset':
fs/btrfs/scrub.c:2269: warning: comparison of distinct pointer types lacks a cast
fs/btrfs/scrub.c:2269: warning: right shift count >= width of type
fs/btrfs/scrub.c:2269: warning: passing argument 1 of '__div64_32' from incompatible pointer type
Since @rot is an int type, we should not use do_div(), fix it.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Harald Hoyer [Tue, 19 Nov 2013 10:36:05 +0000 (11:36 +0100)]
btrfs: allow mounting btrfs subvolumes with different ro/rw options
Given the following /etc/fstab entries:
/dev/sda3 /mnt/foo btrfs subvol=foo,ro 0 0
/dev/sda3 /mnt/bar btrfs subvol=bar,rw 0 0
you can't issue:
$ mount /mnt/foo
$ mount /mnt/bar
You would have to do:
$ mount /mnt/foo
$ mount -o remount,rw /mnt/foo
$ mount --bind -o remount,ro /mnt/foo
$ mount /mnt/bar
or
$ mount /mnt/bar
$ mount --rw /mnt/foo
$ mount --bind -o remount,ro /mnt/foo
With this patch you can do
$ mount /mnt/foo
$ mount /mnt/bar
$ cat /proc/self/mountinfo
49 33 0:41 /foo /mnt/foo ro,relatime shared:36 - btrfs /dev/sda3 rw,ssd,space_cache
87 33 0:41 /bar /mnt/bar rw,relatime shared:74 - btrfs /dev/sda3 rw,ssd,space_cache
Signed-off-by: Chris Mason <clm@fb.com>
David Sterba [Fri, 7 Feb 2014 13:34:12 +0000 (14:34 +0100)]
btrfs: export global block reserve size as space_info
Introduce a block group type bit for a global reserve and fill the space
info for SPACE_INFO ioctl. This should replace the newly added ioctl
(
01e219e8069516cdb98594d417b8bb8d906ed30d) to get just the 'size' part
of the global reserve, while the actual usage can be now visible in the
'btrfs fi df' output during ENOSPC stress.
The unpatched userspace tools will show the blockgroup as 'unknown'.
CC: Jeff Mahoney <jeffm@suse.com>
CC: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Sergei Trofimovich [Mon, 7 Apr 2014 07:55:46 +0000 (10:55 +0300)]
btrfs: fix crash in remount(thread_pool=) case
Reproducer:
mount /dev/ubda /mnt
mount -oremount,thread_pool=42 /mnt
Gives a crash:
? btrfs_workqueue_set_max+0x0/0x70
btrfs_resize_thread_pool+0xe3/0xf0
? sync_filesystem+0x0/0xc0
? btrfs_resize_thread_pool+0x0/0xf0
btrfs_remount+0x1d2/0x570
? kern_path+0x0/0x80
do_remount_sb+0xd9/0x1c0
do_mount+0x26a/0xbf0
? kfree+0x0/0x1b0
SyS_mount+0xc4/0x110
It's a call
btrfs_workqueue_set_max(fs_info->scrub_wr_completion_workers, new_pool_size);
with
fs_info->scrub_wr_completion_workers = NULL;
as scrub wqs get created only on user's demand.
Patch skips not-created-yet workqueues.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
CC: Qu Wenruo <quwenruo@cn.fujitsu.com>
CC: Chris Mason <clm@fb.com>
CC: Josef Bacik <jbacik@fb.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Fri, 14 Mar 2014 20:36:53 +0000 (16:36 -0400)]
Btrfs: abort the transaction when we don't find our extent ref
I'm not sure why we weren't aborting here in the first place, it is obviously a
bad time from the fact that we print the leaf and yell loudly about it. Fix
this up, otherwise we panic because our path could be pointing into oblivion.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Mon, 7 Apr 2014 14:10:40 +0000 (07:10 -0700)]
Btrfs: fix EINVAL checks in btrfs_clone
btrfs_drop_extents can now return -EINVAL, but only one caller
in btrfs_clone was checking for it. This adds it to the
caller for inline extents, which is where we really need it.
Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Wed, 2 Apr 2014 11:53:32 +0000 (19:53 +0800)]
Btrfs: fix unlock in __start_delalloc_inodes()
This patch fix a regression caused by the following patch:
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
break while loop will make us call @spin_unlock() without
calling @spin_lock() before, fix it.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Tue, 1 Apr 2014 10:01:43 +0000 (18:01 +0800)]
Btrfs: scrub raid56 stripes in the right way
Steps to reproduce:
# mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5
# mount /dev/sda8 /mnt
# btrfs scrub start -BR /mnt
# echo $? <--unverified errors make return value be 3
This is because we don't setup right mapping between physical
and logical address for raid56, which makes checksum mismatch.
But we will find everthing is fine later when rechecking using
btrfs_map_block().
This patch fixed the problem by settuping right mappings and
we only verify data stripes' checksums.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Tue, 1 Apr 2014 10:01:42 +0000 (18:01 +0800)]
Btrfs: don't compress for a small write
To compress a small file range(<=blocksize) that is not
an inline extent can not save disk space at all. skip it can
save us some cpu time.
This patch can also fix wrong setting nocompression flag for
inode, say a case when @total_in is 4096, and then we get
@total_compressed 52,because we do aligment to page cache size
firstly, and then we get into conclusion @total_in=@total_compressed
thus we will clear this inode's compression flag.
An exception comes from inserting inline extent failure but we
still have @total_compressed < @total_in,so we will still reset
inode's flag, this is ok, because we don't have good compression
effect.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Mon, 31 Mar 2014 13:53:25 +0000 (14:53 +0100)]
Btrfs: more efficient io tree navigation on wait_extent_bit
If we don't reschedule use rb_next to find the next extent state
instead of a full tree search, which is more efficient and safe
since we didn't release the io tree's lock.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Mon, 31 Mar 2014 13:52:14 +0000 (14:52 +0100)]
Btrfs: send, build path string only once in send_hole
There's no point building the path string in each iteration of the
send_hole loop, as it produces always the same string.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Gui Hecheng [Mon, 31 Mar 2014 10:03:25 +0000 (18:03 +0800)]
btrfs: filter invalid arg for btrfs resize
Originally following cmds will work:
# btrfs fi resize -10A <mnt>
# btrfs fi resize -10Gaha <mnt>
Filter the arg by checking the return pointer of memparse.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Sun, 30 Mar 2014 22:02:53 +0000 (23:02 +0100)]
Btrfs: send, fix data corruption due to incorrect hole detection
During an incremental send, when we finish processing an inode (corresponding to
a regular file) we would assume the gap between the end of the last processed file
extent and the file's size corresponded to a file hole, and therefore incorrectly
send a bunch of zero bytes to overwrite that region in the file.
This affects only kernel 3.14.
Reproducer:
mkfs.btrfs -f /dev/sdc
mount /dev/sdc /mnt
xfs_io -f -c "falloc -k 0
268435456" /mnt/foo
btrfs subvolume snapshot -r /mnt /mnt/mysnap0
xfs_io -c "pwrite -S 0x01 -b 9216
16190218 9216" /mnt/foo
xfs_io -c "pwrite -S 0x02 -b 1121
198720104 1121" /mnt/foo
xfs_io -c "pwrite -S 0x05 -b 9216
107887439 9216" /mnt/foo
xfs_io -c "pwrite -S 0x06 -b 9216
225520207 9216" /mnt/foo
xfs_io -c "pwrite -S 0x07 -b 67584
102138300 67584" /mnt/foo
xfs_io -c "pwrite -S 0x08 -b 7000
94897484 7000" /mnt/foo
xfs_io -c "pwrite -S 0x09 -b 113664
245083212 113664" /mnt/foo
xfs_io -c "pwrite -S 0x10 -b 123
17937788 123" /mnt/foo
xfs_io -c "pwrite -S 0x11 -b 39936
229573311 39936" /mnt/foo
xfs_io -c "pwrite -S 0x12 -b 67584
174792222 67584" /mnt/foo
xfs_io -c "pwrite -S 0x13 -b 9216
249253213 9216" /mnt/foo
xfs_io -c "pwrite -S 0x16 -b 67584
150046083 67584" /mnt/foo
xfs_io -c "pwrite -S 0x17 -b 39936
118246040 39936" /mnt/foo
xfs_io -c "pwrite -S 0x18 -b 67584
215965442 67584" /mnt/foo
xfs_io -c "pwrite -S 0x19 -b 33792
97096725 33792" /mnt/foo
xfs_io -c "pwrite -S 0x20 -b 125952
166300596 125952" /mnt/foo
xfs_io -c "pwrite -S 0x21 -b 123
1078957 123" /mnt/foo
xfs_io -c "pwrite -S 0x25 -b 9216
212044492 9216" /mnt/foo
xfs_io -c "pwrite -S 0x26 -b 7000
265037146 7000" /mnt/foo
xfs_io -c "pwrite -S 0x27 -b 42757
215922685 42757" /mnt/foo
xfs_io -c "pwrite -S 0x28 -b 7000
69865411 7000" /mnt/foo
xfs_io -c "pwrite -S 0x29 -b 67584
67948958 67584" /mnt/foo
xfs_io -c "pwrite -S 0x30 -b 39936
266967019 39936" /mnt/foo
xfs_io -c "pwrite -S 0x31 -b 1121
19582453 1121" /mnt/foo
xfs_io -c "pwrite -S 0x32 -b 17408
257710255 17408" /mnt/foo
xfs_io -c "pwrite -S 0x33 -b 39936
3895518 39936" /mnt/foo
xfs_io -c "pwrite -S 0x34 -b 125952
12045847 125952" /mnt/foo
xfs_io -c "pwrite -S 0x35 -b 17408
19156379 17408" /mnt/foo
xfs_io -c "pwrite -S 0x36 -b 39936
50160066 39936" /mnt/foo
xfs_io -c "pwrite -S 0x37 -b 113664
9549793 113664" /mnt/foo
xfs_io -c "pwrite -S 0x38 -b 105472
94391506 105472" /mnt/foo
xfs_io -c "pwrite -S 0x39 -b 23552
143632863 23552" /mnt/foo
xfs_io -c "pwrite -S 0x40 -b 39936
241283845 39936" /mnt/foo
xfs_io -c "pwrite -S 0x41 -b 113664
199937606 113664" /mnt/foo
xfs_io -c "pwrite -S 0x42 -b 67584
67380093 67584" /mnt/foo
xfs_io -c "pwrite -S 0x43 -b 67584
26793129 67584" /mnt/foo
xfs_io -c "pwrite -S 0x44 -b 39936
14421913 39936" /mnt/foo
xfs_io -c "pwrite -S 0x45 -b 123
253097405 123" /mnt/foo
xfs_io -c "pwrite -S 0x46 -b 1121
128233424 1121" /mnt/foo
xfs_io -c "pwrite -S 0x47 -b 105472
91577959 105472" /mnt/foo
xfs_io -c "pwrite -S 0x48 -b 1121
7245381 1121" /mnt/foo
xfs_io -c "pwrite -S 0x49 -b 113664
182414694 113664" /mnt/foo
xfs_io -c "pwrite -S 0x50 -b 9216
32750608 9216" /mnt/foo
xfs_io -c "pwrite -S 0x51 -b 67584
266546049 67584" /mnt/foo
xfs_io -c "pwrite -S 0x52 -b 67584
87969398 67584" /mnt/foo
xfs_io -c "pwrite -S 0x53 -b 9216
260848797 9216" /mnt/foo
xfs_io -c "pwrite -S 0x54 -b 39936
119461243 39936" /mnt/foo
xfs_io -c "pwrite -S 0x55 -b 7000
200178693 7000" /mnt/foo
xfs_io -c "pwrite -S 0x56 -b 9216
243316029 9216" /mnt/foo
xfs_io -c "pwrite -S 0x57 -b 7000
209658229 7000" /mnt/foo
xfs_io -c "pwrite -S 0x58 -b 101376
179745192 101376" /mnt/foo
xfs_io -c "pwrite -S 0x59 -b 9216
64012300 9216" /mnt/foo
xfs_io -c "pwrite -S 0x60 -b 125952
181705139 125952" /mnt/foo
xfs_io -c "pwrite -S 0x61 -b 23552
235737348 23552" /mnt/foo
xfs_io -c "pwrite -S 0x62 -b 113664
106021355 113664" /mnt/foo
xfs_io -c "pwrite -S 0x63 -b 67584
135753552 67584" /mnt/foo
xfs_io -c "pwrite -S 0x64 -b 23552
95730888 23552" /mnt/foo
xfs_io -c "pwrite -S 0x65 -b 11
17311415 11" /mnt/foo
xfs_io -c "pwrite -S 0x66 -b 33792
120695553 33792" /mnt/foo
xfs_io -c "pwrite -S 0x67 -b 9216
17164631 9216" /mnt/foo
xfs_io -c "pwrite -S 0x68 -b 9216
136065853 9216" /mnt/foo
xfs_io -c "pwrite -S 0x69 -b 67584
37752198 67584" /mnt/foo
xfs_io -c "pwrite -S 0x70 -b 101376
189717473 101376" /mnt/foo
xfs_io -c "pwrite -S 0x71 -b 7000
227463698 7000" /mnt/foo
xfs_io -c "pwrite -S 0x72 -b 9216
12655137 9216" /mnt/foo
xfs_io -c "pwrite -S 0x73 -b 7000
7488866 7000" /mnt/foo
xfs_io -c "pwrite -S 0x74 -b 113664
87813649 113664" /mnt/foo
xfs_io -c "pwrite -S 0x75 -b 33792
25802183 33792" /mnt/foo
xfs_io -c "pwrite -S 0x76 -b 39936
93524024 39936" /mnt/foo
xfs_io -c "pwrite -S 0x77 -b 33792
113336388 33792" /mnt/foo
xfs_io -c "pwrite -S 0x78 -b 105472
184955320 105472" /mnt/foo
xfs_io -c "pwrite -S 0x79 -b 101376
225691598 101376" /mnt/foo
xfs_io -c "pwrite -S 0x80 -b 23552
77023155 23552" /mnt/foo
xfs_io -c "pwrite -S 0x81 -b 11
201888192 11" /mnt/foo
xfs_io -c "pwrite -S 0x82 -b 11
115332492 11" /mnt/foo
xfs_io -c "pwrite -S 0x83 -b 67584
230278015 67584" /mnt/foo
xfs_io -c "pwrite -S 0x84 -b 11
120589073 11" /mnt/foo
xfs_io -c "pwrite -S 0x85 -b 125952
202207819 125952" /mnt/foo
xfs_io -c "pwrite -S 0x86 -b 113664
86672080 113664" /mnt/foo
xfs_io -c "pwrite -S 0x87 -b 17408
208459603 17408" /mnt/foo
xfs_io -c "pwrite -S 0x88 -b 7000
73372211 7000" /mnt/foo
xfs_io -c "pwrite -S 0x89 -b 7000
42252122 7000" /mnt/foo
xfs_io -c "pwrite -S 0x90 -b 23552
46784881 23552" /mnt/foo
xfs_io -c "pwrite -S 0x91 -b 101376
63172351 101376" /mnt/foo
xfs_io -c "pwrite -S 0x92 -b 23552
59341931 23552" /mnt/foo
xfs_io -c "pwrite -S 0x93 -b 39936
239599283 39936" /mnt/foo
xfs_io -c "pwrite -S 0x94 -b 67584
175643105 67584" /mnt/foo
xfs_io -c "pwrite -S 0x97 -b 23552
105534880 23552" /mnt/foo
xfs_io -c "pwrite -S 0x98 -b 113664
8236844 113664" /mnt/foo
xfs_io -c "pwrite -S 0x99 -b 125952
144489686 125952" /mnt/foo
xfs_io -c "pwrite -S 0xa0 -b 7000
73273112 7000" /mnt/foo
xfs_io -c "pwrite -S 0xa1 -b 125952
194580243 125952" /mnt/foo
xfs_io -c "pwrite -S 0xa2 -b 123
56296779 123" /mnt/foo
xfs_io -c "pwrite -S 0xa3 -b 11
233066845 11" /mnt/foo
xfs_io -c "pwrite -S 0xa4 -b 39936
197727090 39936" /mnt/foo
xfs_io -c "pwrite -S 0xa5 -b 101376
53579812 101376" /mnt/foo
xfs_io -c "pwrite -S 0xa6 -b 9216
85669738 9216" /mnt/foo
xfs_io -c "pwrite -S 0xa7 -b 125952
21266322 125952" /mnt/foo
xfs_io -c "pwrite -S 0xa8 -b 23552
125726568 23552" /mnt/foo
xfs_io -c "pwrite -S 0xa9 -b 9216
18423680 9216" /mnt/foo
xfs_io -c "pwrite -S 0xb0 -b 1121
165901483 1121" /mnt/foo
btrfs subvolume snapshot -r /mnt /mnt/mysnap1
xfs_io -c "pwrite -S 0xff -b 10
16190218 10" /mnt/foo
btrfs subvolume snapshot -r /mnt /mnt/mysnap2
md5sum /mnt/foo # returns
79e53f1466bfc09fd82b450689e6119e
md5sum /mnt/mysnap2/foo # returns
79e53f1466bfc09fd82b450689e6119e too
btrfs send /mnt/mysnap1 -f /tmp/1.snap
btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap
mkfs.btrfs -f /dev/sdc
mount /dev/sdc /mnt
btrfs receive /mnt -f /tmp/1.snap
btrfs receive /mnt -f /tmp/2.snap
md5sum /mnt/mysnap2/foo # returns
2bb414c5155767cedccd7063e51beabd !!
A testcase for xfstests follows soon too.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Dan Carpenter [Fri, 28 Mar 2014 08:06:00 +0000 (11:06 +0300)]
Btrfs: kmalloc() doesn't return an ERR_PTR
The error handling was copy and pasted from memdup_user(). It should be
checking for NULL obviously.
Fixes: abccd00f8af2 ('btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Thu, 27 Mar 2014 03:12:25 +0000 (11:12 +0800)]
Btrfs: fix snapshot vs nocow writting
While running fsstress and snapshots concurrently, we will hit something
like followings:
Thread 1 Thread 2
|->fallocate
|->write pages
|->join transaction
|->add ordered extent
|->end transaction
|->flushing data
|->creating pending snapshots
|->write data into src root's
fallocated space
After above work flows finished, we will get a state that source and
snapshot root share same space, but source root have written data into
fallocated space, this will make fsck fail to verify checksums for
snapshot root's preallocating file extent data.Nocow writting also
has this same problem.
Fix this problem by syncing snapshots with nocow writting:
1.for nocow writting,if there are pending snapshots, we will
fall into COW way.
2.if there are pending nocow writes, snapshots for this root
will be blocked until nocow writting finish.
Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Qu Wenruo [Thu, 27 Mar 2014 02:51:58 +0000 (02:51 +0000)]
btrfs: Change the expanding write sequence to fix snapshot related bug.
When testing fsstress with snapshot making background, some snapshot
following problem.
Snapshot 270:
inode 323: size 0
Snapshot 271:
inode 323: size 349145
|-------Hole---|---------Empty gap-------|-------Hole-----|
0 122880 172032 349145
Snapshot 272:
inode 323: size 349145
|-------Hole---|------------Data---------|-------Hole-----|
0 122880 172032 349145
The fsstress operation on inode 323 is the following:
write: offset 126832 len 43124
truncate: size 349145
Since the write with offset is consist of 2 operations:
1. punch hole
2. write data
Hole punching is faster than data write, so hole punching in write
and truncate is done first and then buffered write, so the snapshot 271 got
empty gap, which will not pass btrfsck.
To fix the bug, this patch will change the write sequence which will
first punch a hole covering the write end if a hole is needed.
Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
David Sterba [Wed, 26 Mar 2014 17:26:36 +0000 (18:26 +0100)]
btrfs: make device scan less noisy
Print the message only when the device is seen for the first time.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Jeff Mahoney [Wed, 26 Mar 2014 18:11:26 +0000 (14:11 -0400)]
btrfs: fix lockdep warning with reclaim lock inversion
When encountering memory pressure, testers have run into the following
lockdep warning. It was caused by __link_block_group calling kobject_add
with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL,
which gets us into reclaim context. The kobject doesn't actually need
to be added under the lock -- it just needs to ensure that it's only
added for the first block group to be linked.
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.14.0-rc8-default #1 Not tainted
---------------------------------------------------------
kswapd0/169 just changed the state of lock:
(&delayed_node->mutex){+.+.-.}, at: [<
ffffffffa018baea>] __btrfs_release_delayed_node+0x3a/0x200 [btrfs]
but this lock took another, RECLAIM_FS-unsafe lock in the past:
(&found->groups_sem){+++++.}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&found->groups_sem);
local_irq_disable();
lock(&delayed_node->mutex);
lock(&found->groups_sem);
<Interrupt>
lock(&delayed_node->mutex);
*** DEADLOCK ***
2 locks held by kswapd0/169:
#0: (shrinker_rwsem){++++..}, at: [<
ffffffff81159e8a>] shrink_slab+0x3a/0x160
#1: (&type->s_umount_key#27){++++..}, at: [<
ffffffff811bac6f>] grab_super_passive+0x3f/0x90
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Fri, 28 Mar 2014 21:16:01 +0000 (17:16 -0400)]
Btrfs: hold the commit_root_sem when getting the commit root during send
We currently rely too heavily on roots being read-only to save us from just
accessing root->commit_root. We can easily balance blocks out from underneath a
read only root, so to save us from getting screwed make sure we only access
root->commit_root under the commit root sem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Thu, 13 Mar 2014 19:42:13 +0000 (15:42 -0400)]
Btrfs: remove transaction from send
Lets try this again. We can deadlock the box if we send on a box and try to
write onto the same fs with the app that is trying to listen to the send pipe.
This is because the writer could get stuck waiting for a transaction commit
which is being blocked by the send. So fix this by making sure looking at the
commit roots is always going to be consistent. We do this by keeping track of
which roots need to have their commit roots swapped during commit, and then
taking the commit_root_sem and swapping them all at once. Then make sure we
take a read lock on the commit_root_sem in cases where we search the commit root
to make sure we're always looking at a consistent view of the commit roots.
Previously we had problems with this because we would swap a fs tree commit root
and then swap the extent tree commit root independently which would cause the
backref walking code to screw up sometimes. With this patch we no longer
deadlock and pass all the weird send/receive corner cases. Thanks,
Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Fri, 28 Mar 2014 21:07:27 +0000 (17:07 -0400)]
Btrfs: don't clear uptodate if the eb is under IO
So I have an awful exercise script that will run snapshot, balance and
send/receive in parallel. This sometimes would crash spectacularly and when it
came back up the fs would be completely hosed. Turns out this is because of a
bad interaction of balance and send/receive. Send will hold onto its entire
path for the whole send, but its blocks could get relocated out from underneath
it, and because it doesn't old tree locks theres nothing to keep this from
happening. So it will go to read in a slot with an old transid, and we could
have re-allocated this block for something else and it could have a completely
different transid. But because we think it is invalid we clear uptodate and
re-read in the block. If we do this before we actually write out the new block
we could write back stale data to the fs, and boom we're screwed.
Now we definitely need to fix this disconnect between send and balance, but we
really really need to not allow ourselves to accidently read in stale data over
new data. So make sure we check if the extent buffer is not under io before
clearing uptodate, this will kick back EIO to the caller instead of reading in
stale data and keep us from corrupting the fs. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Thu, 27 Mar 2014 23:41:34 +0000 (19:41 -0400)]
Btrfs: check for an extent_op on the locked ref
We could have possibly added an extent_op to the locked_ref while we dropped
locked_ref->lock, so check for this case as well and loop around. Otherwise we
could lose flag updates which would lead to extent tree corruption. Thanks,
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Thu, 27 Mar 2014 18:56:51 +0000 (14:56 -0400)]
Btrfs: do not reset last_snapshot after relocation
This was done to allow NO_COW to continue to be NO_COW after relocation but it
is not right. When relocating we will convert blocks to FULL_BACKREF that we
relocate. We can leave some of these full backref blocks behind if they are not
cow'ed out during the relocation, like if we fail the relocation with ENOSPC and
then just drop the reloc tree. Then when we go to cow the block again we won't
lookup the extent flags because we won't think there has been a snapshot
recently which means we will do our normal ref drop thing instead of adding back
a tree ref and dropping the shared ref. This will cause btrfs_free_extent to
blow up because it can't find the ref we are trying to free. This was found
with my ref verifying tool. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Liu Bo [Mon, 10 Mar 2014 10:56:07 +0000 (18:56 +0800)]
Btrfs: fix a crash of clone with inline extents's split
xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
of inline extents in __btrfs_drop_extents().
For inline extents, we cannot duplicate another EXTENT_DATA item, because
it breaks the rule of inline extents, that is, 'start offset' needs to be 0.
We have set limitations for the source inode's compressed inline extents,
because it needs to decompress and recompress. Now the destination inode's
inline extents also need similar limitations.
With this, xfstests btrfs/035 doesn't run into panic.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Fri, 21 Mar 2014 22:30:44 +0000 (15:30 -0700)]
btrfs: fix uninit variable warning
fs/btrfs/send.c:2926: warning: ‘entry’ may be used uninitialized in this
function
Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Wed, 19 Mar 2014 17:35:14 +0000 (13:35 -0400)]
Btrfs: take into account total references when doing backref lookup
I added an optimization for large files where we would stop searching for
backrefs once we had looked at the number of references we currently had for
this extent. This works great most of the time, but for snapshots that point to
this extent and has changes in the original root this assumption falls on it
face. So keep track of any delayed ref mods made and add in the actual ref
count as reported by the extent item and use that to limit how far down an inode
we'll search for extents. Thanks,
Reportedy-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reported-by: Hugo Mills <hugo@carfax.org.uk>
Tested-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Wed, 19 Mar 2014 14:20:54 +0000 (14:20 +0000)]
Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
For an incremental send, fix the process of determining whether the directory
inode we're currently processing needs to have its move/rename operation delayed.
We were ignoring the fact that if the inode's new immediate ancestor has a higher
inode number than ours but wasn't renamed/moved, we might still need to delay our
move/rename, because some other ancestor directory higher in the hierarchy might
have an inode number higher than ours *and* was renamed/moved too - in this case
we have to wait for rename/move of that ancestor to happen before our current
directory's rename/move operation.
Simple steps to reproduce this issue:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/x1/x2
$ mkdir /mnt/a/Z
$ mkdir -p /mnt/a/x1/x2/x3/x4/x5
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/a/x1/x2/x3 /mnt/a/Z/X33
$ mv /mnt/a/x1/x2 /mnt/a/Z/X33/x4/x5/X22
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
The incremental send caused the kernel code to enter an infinite loop when
building the path string for directory Z after its references are processed.
A more complex scenario:
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir -p /mnt/a/b/c/d
$ mkdir /mnt/a/b/c/d/e
$ mkdir /mnt/a/b/c/d/f
$ mv /mnt/a/b/c/d/e /mnt/a/b/c/d/f/E2
$ mkdir /mmt/a/b/c/g
$ mv /mnt/a/b/c/d /mnt/a/b/D2
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mkdir /mnt/a/o
$ mv /mnt/a/b/c/g /mnt/a/b/D2/f/G2
$ mv /mnt/a/b/D2 /mnt/a/b/dd
$ mv /mnt/a/b/c /mnt/a/C2
$ mv /mnt/a/b/dd/f /mnt/a/o/FF
$ mv /mnt/a/b /mnt/a/o/FF/E2/BB
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Sun, 16 Mar 2014 20:37:26 +0000 (20:37 +0000)]
Btrfs: fix incremental send's decision to delay a dir move/rename
It's possible to change the parent/child relationship between directories
in such a way that if a child directory has a higher inode number than
its parent, it doesn't necessarily means the child rename/move operation
can be performed immediately. The parent migth have its own rename/move
operation delayed, therefore in this case the child needs to have its
rename/move operation delayed too, and be performed after its new parent's
rename/move.
Steps to reproduce the issue:
$ umount /mnt
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ mkdir /mnt/A
$ mkdir /mnt/B
$ mkdir /mnt/C
$ mv /mnt/C /mnt/A
$ mv /mnt/B /mnt/A/C
$ mkdir /mnt/A/C/D
$ btrfs subvolume snapshot -r /mnt /mnt/snap1
$ btrfs send /mnt/snap1 -f /tmp/base.send
$ mv /mnt/A/C/D /mnt/A/D2
$ mv /mnt/A/C/B /mnt/A/D2/B2
$ mv /mnt/A/C /mnt/A/D2/B2/C2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send
The incremental send caused the kernel code to enter an infinite loop when
building the path string for directory C after its references are processed.
The necessary conditions here are that C has an inode number higher than both
A and B, and B as an higher inode number higher than A, and D has the highest
inode number, that is:
inode_number(A) < inode_number(B) < inode_number(C) < inode_number(D)
The same issue could happen if after the first snapshot there's any number
of intermediary parent directories between A2 and B2, and between B2 and C2.
A test case for xfstests follows, covering this simple case and more advanced
ones, with files and hard links created inside the directories.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Tue, 18 Mar 2014 17:56:06 +0000 (17:56 +0000)]
Btrfs: remove unnecessary inode generation lookup in send
No need to search in the send tree for the generation number of the inode,
we already have it in the recorded_ref structure passed to us.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Fri, 14 Mar 2014 20:55:01 +0000 (20:55 +0000)]
Btrfs: fix race when updating existing ref head
While we update an existing ref head's extent_op, we're not holding
its spinlock, so while we're updating its extent_op contents (key,
flags) we can have a task running __btrfs_run_delayed_refs() that
holds the ref head's lock and sets its extent_op to NULL right after
the task updating the ref head just checked its extent_op was not NULL.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Qu Wenruo [Wed, 12 Mar 2014 08:05:33 +0000 (08:05 +0000)]
btrfs: Add trace for btrfs_workqueue alloc/destroy
Since most of the btrfs_workqueue is printed as pointer address,
for easier analysis, add trace for btrfs_workqueue alloc/destroy.
So it is possible to determine the workqueue that a given work belongs
to(by comparing the wq pointer address with alloc trace event).
Signed-off-by: Qu Wenruo <quenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Wed, 12 Mar 2014 01:28:24 +0000 (01:28 +0000)]
Btrfs: less fs tree lock contention when using autodefrag
When finding new extents during an autodefrag, don't do so many fs tree
lookups to find an extent with a size smaller then the target treshold.
Instead, after each fs tree forward search immediately unlock upper
levels and process the entire leaf while holding a read lock on the leaf,
since our leaf processing is very fast.
This reduces lock contention, allowing for higher concurrency when other
tasks want to write/update items related to other inodes in the fs tree,
as we're not holding read locks on upper tree levels while processing the
leaf and we do less tree searches.
Test:
sysbench --test=fileio --file-num=512 --file-total-size=16G \
--file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
--file-rw-ratio=3 --file-io-mode=sync --max-time=1800 \
--max-requests=
10000000000 [prepare|run]
(fileystem mounted with -o autodefrag, averages of 5 runs)
Before this change: 58.852Mb/sec throughtput, read 77.589Gb, written 25.863Gb
After this change: 63.034Mb/sec throughtput, read 83.102Gb, written 27.701Gb
Test machine: quad core intel i5-3570K, 32Gb of RAM, SSD.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Guangyu Sun [Tue, 11 Mar 2014 18:24:18 +0000 (11:24 -0700)]
Btrfs: return EPERM when deleting a default subvolume
The error message is confusing:
# btrfs sub delete /mnt/mysub/
Delete subvolume '/mnt/mysub'
ERROR: cannot delete '/mnt/mysub' - Directory not empty
The error message does not make sense to me: It's not about deleting a
directory but it's a subvolume, and it doesn't matter if the subvolume is
empty or not.
Maybe EPERM or is more appropriate in this case, combined with an explanatory
kernel log message. (e.g. "subvolume with ID 123 cannot be deleted because
it is configured as default subvolume.")
Reported-by: Koen De Wit <koen.de.wit@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Tue, 11 Mar 2014 14:31:44 +0000 (14:31 +0000)]
Btrfs: add missing kfree in btrfs_destroy_workqueue
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Filipe Manana [Tue, 11 Mar 2014 13:56:15 +0000 (13:56 +0000)]
Btrfs: cache extent states in defrag code path
When locking file ranges in the inode's io_tree, cache the first
extent state that belongs to the target range, so that when unlocking
the range we don't need to search in the io_tree again, reducing cpu
time and making and therefore holding the io_tree's lock for a shorter
period.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Fri, 7 Mar 2014 00:01:07 +0000 (19:01 -0500)]
Btrfs: fix deadlock with nested trans handles
Zach found this deadlock that would happen like this
btrfs_end_transaction <- reduce trans->use_count to 0
btrfs_run_delayed_refs
btrfs_cow_block
find_free_extent
btrfs_start_transaction <- increase trans->use_count to 1
allocate chunk
btrfs_end_transaction <- decrease trans->use_count to 0
btrfs_run_delayed_refs
lock tree block we are cowing above ^^
We need to only decrease trans->use_count if it is above 1, otherwise leave it
alone. This will make nested trans be the only ones who decrease their added
ref, and will let us get rid of the trans->use_count++ hack if we have to commit
the transaction. Thanks,
cc: stable@vger.kernel.org
Reported-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
Miao Xie [Thu, 6 Mar 2014 05:55:03 +0000 (13:55 +0800)]
Btrfs: fix possible empty list access when flushing the delalloc inodes
We didn't have a lock to protect the access to the delalloc inodes list, that is
we might access a empty delalloc inodes list if someone start flushing delalloc
inodes because the delalloc inodes were moved into a other list temporarily.
Fix it by wrapping the access with a lock.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Thu, 6 Mar 2014 05:55:02 +0000 (13:55 +0800)]
Btrfs: split the global ordered extents mutex
When we create a snapshot, we just need wait the ordered extents in
the source fs/file root, but because we use the global mutex to protect
this ordered extents list of the source fs/file root to avoid accessing
a empty list, if someone got the mutex to access the ordered extents list
of the other fs/file root, we had to wait.
This patch splits the above global mutex, now every fs/file root has
its own mutex to protect its own list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Thu, 6 Mar 2014 05:55:01 +0000 (13:55 +0800)]
Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
We needn't flush all delalloc inodes when we doesn't get s_umount lock,
or we would make the tasks wait for a long time.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Thu, 6 Mar 2014 05:55:00 +0000 (13:55 +0800)]
Btrfs: reclaim delalloc metadata more aggressively
generic/074 in xfstests failed sometimes because of the enospc error,
the reason of this problem is that we just reclaimed the space we need
from the reserved space for delalloc, and then tried to reserve the space,
but if some task did no-flush reservation between the above reclamation
and reservation,
Task1 Task2
shrink_delalloc()
reclaim 1 block
(The space that can
be reserved now is 1
block)
do no-flush reservation
reserve 1 block
(The space that can
be reserved now is 0
block)
reserving 1 block failed
the reservation of Task1 failed, but in fact, there was enough space to
reserve if we could reclaim more space before.
Fix this problem by the aggressive reclamation of the reserved delalloc
metadata space.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Thu, 6 Mar 2014 05:54:59 +0000 (13:54 +0800)]
Btrfs: remove unnecessary lock in may_commit_transaction()
The reason is:
- The per-cpu counter has its own lock to protect itself.
- Here we needn't get a exact value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Thu, 6 Mar 2014 05:54:58 +0000 (13:54 +0800)]
Btrfs: remove the unnecessary flush when preparing the pages
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Thu, 6 Mar 2014 05:54:57 +0000 (13:54 +0800)]
Btrfs: just do dirty page flush for the inode with compression before direct IO
As the comment in the btrfs_direct_IO says, only the compressed pages need be
flush again to make sure they are on the disk, but the common pages needn't,
so we add a if statement to check if the inode has compressed pages or not,
if no, skip the flush.
And in order to prevent the write ranges from intersecting, we need wait for
the running ordered extents. But the current code waits for them twice, one
is done before the direct IO starts (in btrfs_wait_ordered_range()), the other
is before we get the blocks, it is unnecessary. because we can do the direct
IO without holding i_mutex, it means that the intersected ordered extents may
happen during the direct IO, the first wait can not avoid this problem. So we
use filemap_fdatawrite_range() instead of btrfs_wait_ordered_range() to remove
the first wait.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Thu, 6 Mar 2014 05:54:56 +0000 (13:54 +0800)]
Btrfs: wake up the tasks that wait for the io earlier
The tasks that wait for the IO_DONE flag just care about the io of the dirty
pages, so it is better to wake up them immediately after all the pages are
written, not the whole process of the io completes.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Thu, 6 Mar 2014 05:54:55 +0000 (13:54 +0800)]
Btrfs: fix early enospc due to the race of the two ordered extent wait
btrfs_wait_ordered_roots() moves all the list entries to a new list,
and then deals with them one by one. But if the other task invokes this
function at that time, it would get a empty list. It makes the enospc
error happens more early. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Thu, 6 Mar 2014 05:38:19 +0000 (13:38 +0800)]
Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume
If the snapshot creation happened after the nocow write but before the dirty
data flush, we would fail to flush the dirty data because of no space.
So we must keep track of when those nocow write operations start and when they
end, if there are nocow writers, the snapshot creators must wait. In order
to implement this function, I introduce btrfs_{start, end}_nocow_write(),
which is similar to mnt_{want,drop}_write().
These two functions are only used for nocow file write operations.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Thu, 6 Mar 2014 04:19:50 +0000 (04:19 +0000)]
btrfs: Add ftrace for btrfs_workqueue
Add ftrace for btrfs_workqueue for further workqueue tunning.
This patch needs to applied after the workqueue replace patchset.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Thu, 6 Mar 2014 04:19:50 +0000 (04:19 +0000)]
btrfs: Cleanup the btrfs_workqueue related function type
The new btrfs_workqueue still use open-coded function defition,
this patch will change them into btrfs_func_t type which is much the
same as kernel workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Liu Bo [Wed, 5 Mar 2014 02:07:35 +0000 (10:07 +0800)]
Btrfs: add readahead for send_write
Btrfs send reads data from disk and then writes to a stream via pipe or
a file via flush.
Currently we're going to read each page a time, so every page results
in a disk read, which is not friendly to disks, esp. HDD. Given that,
the performance can be gained by adding readahead for those pages.
Here is a quick test:
$ btrfs subvolume create send
$ xfs_io -f -c "pwrite 0 1G" send/foobar
$ btrfs subvolume snap -r send ro
$ time "btrfs send ro -f /dev/null"
w/o w
real 1m37.527s 0m9.097s
user 0m0.122s 0m0.086s
sys 0m53.191s 0m12.857s
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Liu Bo [Mon, 3 Mar 2014 13:31:03 +0000 (21:31 +0800)]
Btrfs: share the same code for __record_{new,deleted}_ref
This has no functional change, only picks out the same part of two functions,
and makes it shared.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Mon, 3 Mar 2014 12:28:40 +0000 (12:28 +0000)]
Btrfs: avoid unnecessary utimes update in incremental send
When we're finishing processing of an inode, if we're dealing with a
directory inode that has a pending move/rename operation, we don't
need to send a utimes update instruction to the send stream, as we'll
do it later after doing the move/rename operation. Therefore we save
some time here building paths and doing btree lookups.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Sat, 1 Mar 2014 10:57:03 +0000 (10:57 +0000)]
Btrfs: make defrag not fragment files when using prealloc extents
When using prealloc extents, a file defragment operation may actually
fragment the file and increase the amount of data space used by the file.
This change fixes that behaviour.
Example:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt
$ cd /mnt
$ xfs_io -f -c 'falloc 0
1048576' foobar && sync
$ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar
$ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar
$ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync
Before defragmenting the file:
$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.25MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00
$ btrfs-debug-tree /dev/sdb3
(...)
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
prealloc data disk byte
12845056 nr
1048576
prealloc data offset 0 nr 4096
item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
extent data disk byte
12845056 nr
1048576
extent data offset 4096 nr 102400 ram
1048576
extent compression 0
item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
prealloc data disk byte
12845056 nr
1048576
prealloc data offset 106496 nr 90112
item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
extent data disk byte
12845056 nr
1048576
extent data offset 196608 nr 106496 ram
1048576
extent compression 0
item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
prealloc data disk byte
12845056 nr
1048576
prealloc data offset 303104 nr 593920
item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
extent data disk byte
12845056 nr
1048576
extent data offset 897024 nr 106496 ram
1048576
extent compression 0
item 12 key (257 EXTENT_DATA
1003520) itemoff 15492 itemsize 53
prealloc data disk byte
12845056 nr
1048576
prealloc data offset
1003520 nr 45056
(...)
Now defragmenting the file results in more data space used than before:
$ btrfs filesystem defragment -f foobar && sync
$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.55MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00
And the corresponding file extent items are now no longer perfectly sequential
as before, and we're now needlessly using more space from data block groups:
$ btrfs-debug-tree /dev/sdb3
(...)
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
extent data disk byte
12845056 nr
1048576
extent data offset 0 nr 4096 ram
1048576
extent compression 0
item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
extent data disk byte
13893632 nr 102400
extent data offset 0 nr 102400 ram 102400
extent compression 0
item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
extent data disk byte
12845056 nr
1048576
extent data offset 106496 nr 90112 ram
1048576
extent compression 0
item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
extent data disk byte
13996032 nr 106496
extent data offset 0 nr 106496 ram 106496
extent compression 0
item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
prealloc data disk byte
12845056 nr
1048576
prealloc data offset 303104 nr 593920
item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
extent data disk byte
14102528 nr 106496
extent data offset 0 nr 106496 ram 106496
extent compression 0
item 12 key (257 EXTENT_DATA
1003520) itemoff 15492 itemsize 53
extent data disk byte
12845056 nr
1048576
extent data offset
1003520 nr 45056 ram
1048576
extent compression 0
(...)
With this change, the above example will no longer cause allocation of new data
space nor change the sequentiality of the file extents, that is, defragment will
be effectless, leaving all extent items pointing to the extent starting at disk
byte
12845056.
In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of
400Mb each, initially consisting of a single prealloc extent of 400Mb, having
random writes happening at a low rate, lead to a total of over ~17Gb of data
space used, not far from eventually reaching an ENOSPC state.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Sat, 1 Mar 2014 10:55:54 +0000 (10:55 +0000)]
Btrfs: correctly flush data on defrag when compression is enabled
When the defrag flag BTRFS_DEFRAG_RANGE_START_IO is set and compression
enabled, we weren't flushing completely, as writing compressed extents
is a 2 steps process, one to compress the data and another one to write
the compressed data to disk.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:19 +0000 (10:46 +0800)]
btrfs: Cleanup the "_struct" suffix in btrfs_workequeue
Since the "_struct" suffix is mainly used for distinguish the differnt
btrfs_work between the original and the newly created one,
there is no need using the suffix since all btrfs_workers are changed
into btrfs_workqueue.
Also this patch fixed some codes whose code style is changed due to the
too long "_struct" suffix.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:18 +0000 (10:46 +0800)]
btrfs: Cleanup the old btrfs_worker.
Since all the btrfs_worker is replaced with the newly created
btrfs_workqueue, the old codes can be easily remove.
Signed-off-by: Quwenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:17 +0000 (10:46 +0800)]
btrfs: Replace fs_info->scrub_* workqueue with btrfs_workqueue.
Replace the fs_info->scrub_* with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:16 +0000 (10:46 +0800)]
btrfs: Replace fs_info->qgroup_rescan_worker workqueue with btrfs_workqueue.
Replace the fs_info->qgroup_rescan_worker with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:15 +0000 (10:46 +0800)]
btrfs: Replace fs_info->delayed_workers workqueue with btrfs_workqueue.
Replace the fs_info->delayed_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:14 +0000 (10:46 +0800)]
btrfs: Replace fs_info->fixup_workers workqueue with btrfs_workqueue.
Replace the fs_info->fixup_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:13 +0000 (10:46 +0800)]
btrfs: Replace fs_info->readahead_workers workqueue with btrfs_workqueue.
Replace the fs_info->readahead_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:12 +0000 (10:46 +0800)]
btrfs: Replace fs_info->cache_workers workqueue with btrfs_workqueue.
Replace the fs_info->cache_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:11 +0000 (10:46 +0800)]
btrfs: Replace fs_info->rmw_workers workqueue with btrfs_workqueue.
Replace the fs_info->rmw_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:10 +0000 (10:46 +0800)]
btrfs: Replace fs_info->endio_* workqueue with btrfs_workqueue.
Replace the fs_info->endio_* workqueues with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:09 +0000 (10:46 +0800)]
btrfs: Replace fs_info->flush_workers with btrfs_workqueue.
Replace the fs_info->submit_workers with the newly created
btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:08 +0000 (10:46 +0800)]
btrfs: Replace fs_info->submit_workers with btrfs_workqueue.
Much like the fs_info->workers, replace the fs_info->submit_workers
use the same btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:07 +0000 (10:46 +0800)]
btrfs: Replace fs_info->delalloc_workers with btrfs_workqueue
Much like the fs_info->workers, replace the fs_info->delalloc_workers
use the same btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:06 +0000 (10:46 +0800)]
btrfs: Replace fs_info->workers with btrfs_workqueue.
Use the newly created btrfs_workqueue_struct to replace the original
fs_info->workers
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:05 +0000 (10:46 +0800)]
btrfs: Add threshold workqueue based on kernel workqueue
The original btrfs_workers has thresholding functions to dynamically
create or destroy kthreads.
Though there is no such function in kernel workqueue because the worker
is not created manually, we can still use the workqueue_set_max_active
to simulated the behavior, mainly to achieve a better HDD performance by
setting a high threshold on submit_workers.
(Sadly, no resource can be saved)
So in this patch, extra workqueue pending counters are introduced to
dynamically change the max active of each btrfs_workqueue_struct, hoping
to restore the behavior of the original thresholding function.
Also, workqueue_set_max_active use a mutex to protect workqueue_struct,
which is not meant to be called too frequently, so a new interval
mechanism is applied, that will only call workqueue_set_max_active after
a count of work is queued. Hoping to balance both the random and
sequence performance on HDD.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:04 +0000 (10:46 +0800)]
btrfs: Add high priority workqueue support for btrfs_workqueue_struct
Add high priority function to btrfs_workqueue.
This is implemented by embedding a btrfs_workqueue into a
btrfs_workqueue and use some helper functions to differ the normal
priority wq and high priority wq.
So the high priority wq is completely independent from the normal
workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:03 +0000 (10:46 +0800)]
btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue
Use kernel workqueue to implement a new btrfs_workqueue_struct, which
has the ordering execution feature like the btrfs_worker.
The func is executed in a concurrency way, and the
ordred_func/ordered_free is executed in the sequence them are queued
after the corresponding func is done.
The new btrfs_workqueue works much like the original one, one workqueue
for normal work and a list for ordered work.
When a work is queued, ordered work will be added to the list and helper
function will be queued into the workqueue.
The helper function will execute a normal work and then check and execute as many
ordered work as possible in the sequence they were queued.
At this patch, high priority work queue or thresholding is not added yet.
The high priority feature and thresholding will be added in the following patches.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Qu Wenruo [Fri, 28 Feb 2014 02:46:02 +0000 (10:46 +0800)]
btrfs: Cleanup the unused struct async_sched.
The struct async_sched is not used by any codes and can be removed.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fusionio.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Liu Bo [Thu, 27 Feb 2014 09:29:01 +0000 (17:29 +0800)]
Btrfs: skip search tree for REG files
It is really unnecessary to search tree again for @gen, @mode and @rdev
in the case of REG inodes' creation, as we've got btrfs_inode_item in sctx,
and @gen, @mode and @rdev can easily be fetched.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Thu, 27 Feb 2014 05:58:05 +0000 (13:58 +0800)]
Btrfs: fix preallocate vs double nocow write
We can not release the reserved metadata space for the first write if we
find the write position is pre-allocated. Because the kernel might write
the data on the disk before we do the second write but after the can-nocow
check, if we release the space for the first write, we might fail to update
the metadata because of no space.
Fix this problem by end nocow write if there is dirty data in the range whose
space is pre-allocated.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Thu, 27 Feb 2014 05:58:04 +0000 (13:58 +0800)]
Btrfs: fix wrong lock range and write size in check_can_nocow()
The write range may not be sector-aligned, for example:
|--------|--------| <- write range, sector-unaligned, size: 2blocks
|--------|--------|--------| <- correct lock range, size: 3blocks
But according to the old code, we used the size of write range to calculate
the lock range directly, not considered the offset, we would get a wrong lock
range:
|--------|--------| <- write range, sector-unaligned, size: 2blocks
|--------|--------| <- wrong lock range, size: 2blocks
And besides that, the old code also had the same problem when calculating
the real write size. Correct them.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
David Sterba [Tue, 25 Feb 2014 18:33:08 +0000 (19:33 +0100)]
btrfs: send: simplify allocation code in fs_path_ensure_buf
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
David Sterba [Tue, 25 Feb 2014 18:32:59 +0000 (19:32 +0100)]
btrfs: send: fix old buffer length in fs_path_ensure_buf
In "btrfs: send: lower memory requirements in common case" the code to
save the old_buf_len was incorrectly moved to a wrong place and broke
the original logic.
Reported-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Tue, 25 Feb 2014 14:15:13 +0000 (14:15 +0000)]
Btrfs: more efficient btrfs_drop_extent_cache
While droping extent map structures from the extent cache that cover our
target range, we would remove each extent map structure from the red black
tree and then add either 1 or 2 new extent map structures if the former
extent map covered sections outside our target range.
This change simply attempts to replace the existing extent map structure
with a new one that covers the subsection we're not interested in, instead
of doing a red black remove operation followed by an insertion operation.
The number of elements in an inode's extent map tree can get very high for large
files under random writes. For example, while running the following test:
sysbench --test=fileio --file-num=1 --file-total-size=10G \
--file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
--max-requests=500000 --file-rw-ratio=2 [prepare|run]
I captured the following histogram capturing the number of extent_map items
in the red black tree while that test was running:
Count: 122462
Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000
1.000 - 5.231: 452 |
5.231 - 187.392: 87 |
187.392 - 585.911: 206 |
585.911 - 1827.438: 623 |
1827.438 - 5695.245: 1962 #
5695.245 - 17744.861: 6204 ####
17744.861 - 55283.764: 21115 ############
55283.764 - 172231.000: 91813 #####################################################
Benchmark:
sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
--num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
--file-io-mode=sync --file-fsync-freq=0 [prepare|run]
Before this change: 122.1Mb/sec
After this change: 125.07Mb/sec
(averages of 5 test runs)
Test machine: quad core intel i5-3570K, 32Gb of ram, SSD
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Wed, 12 Feb 2014 15:05:53 +0000 (15:05 +0000)]
Btrfs: more efficient split extent state insertion
When we split an extent state there's no need to start the rbtree search
from the root node - we can start it from the original extent state node,
since we would end up in its subtree if we do the search starting at the
root node anyway.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Tue, 25 Feb 2014 14:15:12 +0000 (14:15 +0000)]
Btrfs: remove unneeded field / smaller extent_map structure
We don't need to have an unsigned int field in the extent_map struct
to tell us whether the extent map is in the inode's extent_map tree or
not. We can use the rb_node struct field and the RB_CLEAR_NODE and
RB_EMPTY_NODE macros to achieve the same task.
This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a
64 bits system).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Wang Shilong [Thu, 13 Feb 2014 03:19:47 +0000 (11:19 +0800)]
Btrfs: skip locking when searching commit root
We won't change commit root, skip locking dance with commit root
when walking backrefs, this can speed up btrfs send operations.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Wang Shilong [Wed, 19 Feb 2014 11:24:19 +0000 (19:24 +0800)]
Btrfs: wake up @scrub_pause_wait as much as we can
check if @scrubs_running=@scrubs_paused condition inside wait_event()
is not an atomic operation which means we may inc/dec @scrub_running/
paused at any time. Let's wake up @scrub_pause_wait as much as we can
to let commit transaction blocked less.
An example below:
Thread1 Thread2
|->scrub_blocked_if_needed() |->scrub_pending_trans_workers_inc
|->increase @scrub_paused
|->increase @scrub_running
|->wake up scrub_pause_wait list
|->scrub blocked
|->increase @scrub_paused
Thread3 is commiting transaction which is blocked at btrfs_scrub_pause().
So after Thread2 increase @scrub_paused, we meet the condition
@scrub_paused=@scrub_running, but transaction will be still blocked until
another calling to wake up @scrub_pause_wait.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Wang Shilong [Wed, 19 Feb 2014 11:24:18 +0000 (19:24 +0800)]
Btrfs: cancel scrub on transaction abortion
If we fail to commit transaction, we'd better
cancel scrub operations.
Suggested-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Wang Shilong [Wed, 19 Feb 2014 11:24:17 +0000 (19:24 +0800)]
Btrfs: device_replace: fix deadlock for nocow case
commit
cb7ab02156e4 cause a following deadlock found by
xfstests,btrfs/011:
Thread1 is commiting transaction which is blocked at
btrfs_scrub_pause().
Thread2 is calling btrfs_file_aio_write() which has held
inode's @i_mutex and commit transaction(blocked because
Thread1 is committing transaction).
Thread3 is copy_nocow_page worker which will also try to
hold inode @i_mutex, so thread3 will wait Thread1 finished.
Thread4 is waiting pending workers finished which will wait
Thread3 finished. So the problem is like this:
Thread1--->Thread4--->Thread3--->Thread2---->Thread1
Deadlock happens! we fix it by letting Thread1 go firstly,
which means we won't block transaction commit while we are
waiting pending workers finished.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Wang Shilong [Wed, 19 Feb 2014 11:24:16 +0000 (19:24 +0800)]
Btrfs: fix a possible deadlock between scrub and transaction committing
btrfs_scrub_continue() will be called when cleaning up transaction.However,
this can only be called if btrfs_scrub_pause() is called before.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Sachin Kamat [Mon, 17 Feb 2014 03:43:57 +0000 (09:13 +0530)]
btrfs: Use PTR_ERR_OR_ZERO
PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it
also include missing err.h header.
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Fri, 21 Feb 2014 00:01:32 +0000 (00:01 +0000)]
Btrfs: fix send issuing outdated paths for utimes, chown and chmod
When doing an incremental send, if we had a directory pending a move/rename
operation and none of its parents, except for the immediate parent, were
pending a move/rename, after processing the directory's references, we would
be issuing utimes, chown and chmod intructions against am outdated path - a
path which matched the one in the parent root.
This change also simplifies a bit the code that deals with building a path
for a directory which has a move/rename operation delayed.
Steps to reproduce:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/d/e
$ mkdir /mnt/btrfs/a/b/c/f
$ chmod 0777 /mnt/btrfs/a/b/c/d/e
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
$ mv /mnt/btrfs/a/b/c/f /mnt/btrfs/a/b/f2
$ mv /mnt/btrfs/a/b/c/d/e /mnt/btrfs/a/b/f2/e2
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2
$ mv /mnt/btrfs/a/b/c2/d /mnt/btrfs/a/b/c2/d2
$ chmod 0700 /mnt/btrfs/a/b/f2/e2
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
$ umount /mnt/btrfs
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ btrfs receive /mnt/btrfs -f /tmp/base.send
$ btrfs receive /mnt/btrfs -f /tmp/incremental.send
The second btrfs receive command failed with:
ERROR: chmod a/b/c/d/e failed. No such file or directory
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Thu, 20 Feb 2014 21:15:25 +0000 (21:15 +0000)]
Btrfs: correctly determine if blocks are shared in btrfs_compare_trees
Just comparing the pointers (logical disk addresses) of the btree nodes is
not completely bullet proof, we have to check if their generation numbers
match too.
It is guaranteed that a COW operation will result in a block with a different
logical disk address than the original block's address, but over time we can
reuse that former logical disk address.
For example, creating a 2Gb filesystem on a loop device, and having a script
running in a loop always updating the access timestamp of a file, resulted in
the same logical disk address being reused for the same fs btree block in about
only 4 minutes.
This could make us skip entire subtrees when doing an incremental send (which
is currently the only user of btrfs_compare_trees). However the odds of getting
2 blocks at the same tree level, with the same logical disk address, equal first
slot keys and different generations, should hopefully be very low.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Wed, 19 Feb 2014 14:31:44 +0000 (14:31 +0000)]
Btrfs: fix send attempting to rmdir non-empty directories
The incremental send algorithm assumed that it was possible to issue
a directory remove (rmdir) if the the inode number it was currently
processing was greater than (or equal) to any inode that referenced
the directory's inode. This wasn't a valid assumption because any such
inode might be a child directory that is pending a move/rename operation,
because it was moved into a directory that has a higher inode number and
was moved/renamed too - in other words, the case the following commit
addressed:
9f03740a956d7ac6a1b8f8c455da6fa5cae11c22
(Btrfs: fix infinite path build loops in incremental send)
This made an incremental send issue an rmdir operation before the
target directory was actually empty, which made btrfs receive fail.
Therefore it needs to wait for all pending child directory inodes to
be moved/renamed before sending an rmdir operation.
Simple steps to reproduce this issue:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c/x
$ mkdir /mnt/btrfs/a/b/y
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
$ mv /mnt/btrfs/a/b/y /mnt/btrfs/a/b/YY
$ mv /mnt/btrfs/a/b/c/x /mnt/btrfs/a/b/YY
$ rmdir /mnt/btrfs/a/b/c
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
$ umount /mnt/btrfs
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ btrfs receive /mnt/btrfs -f /tmp/base.send
$ btrfs receive /mnt/btrfs -f /tmp/incremental.send
The second btrfs receive command failed with:
ERROR: rmdir o259-6-0 failed. Directory not empty
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Sun, 16 Feb 2014 21:01:39 +0000 (21:01 +0000)]
Btrfs: send, don't send rmdir for same target multiple times
When doing an incremental send, if we delete a directory that has N > 1
hardlinks for the same file and that file has the highest inode number
inside the directory contents, an incremental send would send N times an
rmdir operation against the directory. This made the btrfs receive command
fail on the second rmdir instruction, as the target directory didn't exist
anymore.
Steps to reproduce the issue:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b/c
$ echo 'ola mundo' > /mnt/btrfs/a/b/c/foo.txt
$ ln /mnt/btrfs/a/b/c/foo.txt /mnt/btrfs/a/b/c/bar.txt
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
$ rm -f /mnt/btrfs/a/b/c/foo.txt
$ rm -f /mnt/btrfs/a/b/c/bar.txt
$ rmdir /mnt/btrfs/a/b/c
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
$ umount /mnt/btrfs
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ btrfs receive /mnt/btrfs -f /tmp/base.send
$ btrfs receive /mnt/btrfs -f /tmp/incremental.send
The second btrfs receive command failed with:
ERROR: rmdir o259-6-0 failed. No such file or directory
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Filipe Manana [Sun, 16 Feb 2014 13:43:11 +0000 (13:43 +0000)]
Btrfs: incremental send, fix invalid path after dir rename
This fixes yet one more case not caught by the commit titled:
Btrfs: fix infinite path build loops in incremental send
In this case, even before the initial full send, we have a directory
which is a child of a directory with a higher inode number. Then we
perform the initial send, and after we rename both the child and the
parent, without moving them around. After doing these 2 renames, an
incremental send sent a rename instruction for the child directory
which contained an invalid "from" path (referenced the parent's old
name, not the new one), which made the btrfs receive command fail.
Steps to reproduce:
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ mkdir -p /mnt/btrfs/a/b
$ mkdir /mnt/btrfs/d
$ mkdir /mnt/btrfs/a/b/c
$ mv /mnt/btrfs/d /mnt/btrfs/a/b/c
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
$ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
$ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/x
$ mv /mnt/btrfs/a/b/x/d /mnt/btrfs/a/b/x/y
$ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
$ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
$ umout /mnt/btrfs
$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt/btrfs
$ btrfs receive /mnt/btrfs -f /tmp/base.send
$ btrfs receive /mnt/btrfs -f /tmp/incremental.send
The second btrfs receive command failed with:
"ERROR: rename a/b/c/d -> a/b/x/y failed. No such file or directory"
A test case for xfstests follows.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>