firefly-linux-kernel-4.4.55.git
11 years agorbd: set mapping read-only flag in rbd_add()
Alex Elder [Mon, 6 May 2013 22:40:33 +0000 (17:40 -0500)]
rbd: set mapping read-only flag in rbd_add()

The rbd_dev->mapping field for a parent image is not meaningful.
Since rbd_image_probe() is used both for images being mapped and
their parents, it doesn't make sense to set that flag in that
function.

So move the setting of the mapping.read_only flag out of
rbd_dev_image_probe() and into rbd_add() instead.

This resolves:
    http://tracker.ceph.com/issues/4940

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: support reading parent page data
Alex Elder [Mon, 6 May 2013 22:40:33 +0000 (17:40 -0500)]
rbd: support reading parent page data

Currently, rbd_img_parent_read() assumes the incoming object request
contains bio data.  But if a layered image is part of a multi-layer
stack of images it will result in read requests of page data to parent
images.

Fortunately, it's not hard to add support for page data.

This resolves:
    http://tracker.ceph.com/issues/4939

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: fix an incorrect assertion condition
Alex Elder [Mon, 6 May 2013 22:40:32 +0000 (17:40 -0500)]
rbd: fix an incorrect assertion condition

In rbd_img_obj_parent_read_full_callback() there is an assertion
intended to verify the size of the image request for a full parent
read was the size of the original request's target object.  But
assertion was looking at the parent image order rather than the
original one, and these values can differ.

Fix that.

This resolves:
    http://tracker.ceph.com/issues/4938

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: define rbd_dev_v2_header_info()
Alex Elder [Mon, 6 May 2013 14:51:30 +0000 (09:51 -0500)]
rbd: define rbd_dev_v2_header_info()

This rearranges rbd_dev_v2_refresh() so it works more like
rbd_dev_v1_header_info().  While format 1 images need to read the
whole header object to get any information, format 2 can collect
almost all information selectively.  So the one-time initialization
will remain in a separate function--based on rbd_dev_v2_probe().

Rename rbd_dev_v2_refresh() to be rbd_dev_v2_header_info(), and have
it call rbd_dev_v2_header_onetime() if it's being called for the
first time for the given rbd device.

Rename rbd_dev_v2_probe() to be rbd_dev_v2_header_onetime() and
remove the image size and snapshot context calls it held in
common with the refresh function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: get rid of trivial v1 header wrappers
Alex Elder [Mon, 6 May 2013 14:51:30 +0000 (09:51 -0500)]
rbd: get rid of trivial v1 header wrappers

Get rid of the trivial wrapper functions rbd_dev_v1_refresh() and
rbd_dev_v1_probe(), substituting rbd_dev_v1_header_read() calls
in their place.

Rename rbd_dev_v1_header_read() to be rbd_dev_v1_header_info(), to
be more generic (it will better reflect what happens with format 2
images).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: simplify rbd_dev_v1_probe()
Alex Elder [Mon, 6 May 2013 14:51:30 +0000 (09:51 -0500)]
rbd: simplify rbd_dev_v1_probe()

An rbd_dev structure's fields are all zero-filled for an initial
probe, so there's no need to explicitly zero the parent_spec
and parent_overlap fields in rbd_dev_v1_probe().  Removing these
assignments makes rbd_dev_v1_probe() *almost* trivial.

Move the dout() message that announces discovery of an image into
rbd_dev_image_probe(), generalize to support images in either format
and only show it if an image is fully discovered.

This highlights that are some unnecessary cleanups in the error
path for rbd_dev_v1_probe(), so they can be removed.

Now rbd_dev_v1_probe() *is* a trivial wrapper function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: update in-core header directly
Alex Elder [Mon, 6 May 2013 14:51:29 +0000 (09:51 -0500)]
rbd: update in-core header directly

Now that rbd_header_from_disk() only fills in one-time fields once,
we can extend it slightly so it releases the other fields before
replacing their values.  This way there's no need to pass a
temporary buffer and then copy all the results in.  Just use the rbd
device header structure in rbd_header_from_disk() so its values get
updated directly.

Note that this means we need to take the header semaphore at the
point we update things.  So pass the rbd_dev rather than the address
of its header as its first argument to rbd_header_from_disk(), and
have it return an error code.

As a result, rbd_dev_v1_header_read() does all the work,
rbd_read_header() becomes unnecessary, and rbd_dev_v1_refresh()
becomes a very simple wrapper.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: refactor rbd_header_from_disk()
Alex Elder [Mon, 6 May 2013 14:51:29 +0000 (09:51 -0500)]
rbd: refactor rbd_header_from_disk()

This rearranges rbd_header_from_disk so that it:
    - allocates the snapshot context right away
    - keeps results in local variables, not changing the passed-in
      header until it's known we'll succeed
    - does initialization of set-once fields in a header only if
      they have not already been set

The last point is moot at the moment, because rbd_read_header()
(the only caller) always supplies a zero-filled header buffer.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: zero format 1 header structure earlier
Alex Elder [Mon, 6 May 2013 14:51:29 +0000 (09:51 -0500)]
rbd: zero format 1 header structure earlier

The passed-in header structure is zeroed in rbd_header_from_disk().
Instead, have the caller do it.  Note that there are two callers,
rbd_dev_v1_refresh() and rbd_dev_v1_probe().  The latter already has
a zeroed header structure so zeroing it isn't necessary there.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: set the mapping size and features later
Alex Elder [Mon, 6 May 2013 14:51:29 +0000 (09:51 -0500)]
rbd: set the mapping size and features later

Defer setting the size and features fields of a mapped image until
after the Linux disk structure is set up.  Set the capacity of the
disk after that.

Rearrange the definition of rbd_image_header, separating the fields
that are set only once from those that can be updated.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: always set read-only flag in rbd_add()
Alex Elder [Mon, 6 May 2013 12:40:30 +0000 (07:40 -0500)]
rbd: always set read-only flag in rbd_add()

Hold off setting the read-only flag in rbd_add() for an image being
mapped until we have successfully probed the image.  At that point
we know whether it's a snapshot mapping or not, so we can set the
read-only flag in that one place rather than doing so (for
snapshots) in rbd_dev_mapping_set().  To do this, pass a flag to the
image probe routine indicating whether we want a read-only mapping.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: kill rbd_dev_clear_mapping()
Alex Elder [Mon, 6 May 2013 12:40:30 +0000 (07:40 -0500)]
rbd: kill rbd_dev_clear_mapping()

This function is a duplicate of rbd_dev_mapping_clear(), and was
added by mistake.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: don't look up snapshot id in rbd_dev_mapping_set()
Alex Elder [Mon, 6 May 2013 12:40:30 +0000 (07:40 -0500)]
rbd: don't look up snapshot id in rbd_dev_mapping_set()

Currently rbd_dev_mapping_set() looks up the snapshot id for the
snapshot whose name is found in the rbd device's spec structure.

That function gets called by rbd_dev_device_setup(), which is
called by rbd_add() *after* rbd_dev_image_probe().  If the
image probe succeeds, the rbd device's spec will already have
been updated to include names and ids for all fields.

Therefore there's no need to look up the snapshot id in
rbd_dev_mapping_set().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: don't print warning if not mapping a parent
Alex Elder [Mon, 6 May 2013 12:40:30 +0000 (07:40 -0500)]
rbd: don't print warning if not mapping a parent

The presence of the LAYERING bit in an rbd image's feature mask does
not guarantee the image actually has a parent image.  Currently that
bit is set only when a clone (i.e., image with a parent) is created,
but it is (currently) not cleared if that clone gets flattened back
into a "normal" image.  A "parent_id" query will leave the
parent_spec for the image being mapped a null pointer, but will not
return an error.

Currently, whenever an image with the LAYERED feature gets mapped, a
warning about the use of layered images gets printed.  But we don't
want to do this for a flattened image, so print the warning only
if we find there is a parent spec after the probe.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: kill rbd_update_mapping_size()
Alex Elder [Mon, 6 May 2013 12:40:30 +0000 (07:40 -0500)]
rbd: kill rbd_update_mapping_size()

Since rbd_update_mapping_size() is now a trivial wrapper, just open
code it in its two callers.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: update capacity in rbd_dev_refresh()
Alex Elder [Mon, 6 May 2013 12:40:30 +0000 (07:40 -0500)]
rbd: update capacity in rbd_dev_refresh()

When a mapped image changes size, we change the capacity recorded
for the Linux disk associated with it, in rbd_update_mapping_size().
That function is called in two places--the format 1 and format 2
refresh routines.

There is no need to set the capacity while holding the header
semaphore.  Instead, do it in the common rbd_dev_refresh(), using
the logic that's already there to initiate disk revalidation.

Add handling in the request function, just in case a request
that exceeds the capacity of the device comes in (perhaps one
that was started before a refresh shrunk the device).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: revalidate only for mapping size changes
Alex Elder [Mon, 6 May 2013 12:40:30 +0000 (07:40 -0500)]
rbd: revalidate only for mapping size changes

This commit:
    d98df63e rbd: revalidate_disk upon rbd resize
instituted a call to revalidate_disk() to notify interested parties
that a mapped image has changed size.  This works well, as long as
the the rbd device doesn't map a snapshot.

A snapshot will never change size.  However, the base image the
snapshot is associated with can, and it can do so while the snapshot
is mapped.

The problem is that the test for the size is looking at the size of
the base image, not the size of the mapped snapshot.  This patch
corrects that.

Update the warning message shown in the event of error, and move
it into the callers.

This resolves:
    http://tracker.ceph.com/issues/4911

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: fix leak of format 2 snapshot context
Alex Elder [Mon, 6 May 2013 13:37:00 +0000 (08:37 -0500)]
rbd: fix leak of format 2 snapshot context

When rbd_dev_v2_refresh() is called, the rbd device already has a
snapshot context associated with it.  But that never gets freed,
the pointer just gets overwritten.

Fix this by dropping the rbd device's reference to the snapshot
context before overwriting the pointer.

Because ceph_put_snap_context() already handles for a null pointer
we don't need to check for that (for the probe case, where no
context has yet been assigned).

This resolves:
    http://tracker.ceph.com/issues/4912

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: fix image request leak on parent read
Alex Elder [Thu, 2 May 2013 02:37:07 +0000 (21:37 -0500)]
rbd: fix image request leak on parent read

When a read for a layered image object finds the target object
doesn't exist, a read image request for the parent image is created
and submitted.  When that completes, the callback routine was
not releasing that parent image request.  Fix that.

The slab allocation stuff just added has greatly simplified the
search for the source of this memory leak.

This resolves:
    http://tracker.ceph.com/issues/4803

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agolibceph: use slab cache for osd client requests
Alex Elder [Wed, 1 May 2013 17:43:04 +0000 (12:43 -0500)]
libceph: use slab cache for osd client requests

Create a slab cache to manage allocation of ceph_osdc_request
structures.

This resolves:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agolibceph: allocate ceph message data with a slab allocator
Alex Elder [Wed, 1 May 2013 17:43:04 +0000 (12:43 -0500)]
libceph: allocate ceph message data with a slab allocator

Create a slab cache to manage ceph_msg_data structure allocation.

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agolibceph: allocate ceph messages with a slab allocator
Alex Elder [Wed, 1 May 2013 17:43:04 +0000 (12:43 -0500)]
libceph: allocate ceph messages with a slab allocator

Create a slab cache to manage ceph_msg structure allocation.

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: allocate image object names with a slab allocator
Alex Elder [Wed, 1 May 2013 17:43:04 +0000 (12:43 -0500)]
rbd: allocate image object names with a slab allocator

The names of objects used for image object requests are always fixed
size.  So create a slab cache to manage them.  Define a new function
rbd_segment_name_free() to match rbd_segment_name() (which is what
supplies the dynamically-allocated name buffer).

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: allocate object requests with a slab allocator
Alex Elder [Wed, 1 May 2013 17:43:03 +0000 (12:43 -0500)]
rbd: allocate object requests with a slab allocator

Create a slab cache to manage rbd_obj_request allocation.  We aren't
using a constructor, and we'll zero-fill object request structures
when they're allocated.

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: allocate name separate from obj_request
Alex Elder [Wed, 1 May 2013 17:43:03 +0000 (12:43 -0500)]
rbd: allocate name separate from obj_request

The next patch will define a slab allocator for a object requests.
To use that we'll need to allocate the name of an object separate
from the request structure itself.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: allocate image requests with a slab allocator
Alex Elder [Wed, 1 May 2013 17:43:03 +0000 (12:43 -0500)]
rbd: allocate image requests with a slab allocator

Create a slab cache to manage rbd_img_request allocation.  Nothing
too fancy at this point--we'll still initialize everything at
allocation time (no constructor)

This is part of:
    http://tracker.ceph.com/issues/3926

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: use binary search for snapshot lookup
Alex Elder [Wed, 1 May 2013 17:43:03 +0000 (12:43 -0500)]
rbd: use binary search for snapshot lookup

Use bsearch(3) to make snapshot lookup by id more efficient.  (There
could be thousands of snapshots, and conceivably many more.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: clear EXISTS flag if mapped snapshot disappears
Alex Elder [Wed, 1 May 2013 17:43:03 +0000 (12:43 -0500)]
rbd: clear EXISTS flag if mapped snapshot disappears

This functionality inadvertently disappeared in the last patch.

Image snapshots can get removed at just about any time.  In
particular it can disappear even if it is in use by an rbd
client as a mapped image.

The rbd client deals with such a disappearance by responding to new
requests with ENXIO.  This is implemented by each rbd device
maintaining an EXISTS flag, which is normally set but cleared if a
snapshot disappears.

This patch (re-)implements the clearing of that flag.

Whenever mapped image header information is refreshed, if the
mapping is for a snapshot, verify the mapped snapshot is still
present in the updated snapshot context.  If it is not, clear the
flag.

It is not necessary to check this in the initial probe, because the
probe will not succeed if the snapshot doesn't exist.

This resolves:
    http://tracker.ceph.com/issues/4880

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: kill off the snapshot list
Alex Elder [Tue, 30 Apr 2013 05:44:33 +0000 (00:44 -0500)]
rbd: kill off the snapshot list

We no longer use the snapshot list for anything.  When we need to
look up a snapshot name, id, size, or feature mask, we just do it
directly rather than relying on this list being updated with every
refresh.  The main reason it existed was for the benefit of the
device/sysfs entries that previously were associated with snapshots.

So get rid of the snapshot list, and struct rbd_snap, and the
hundreds of lines of code that supported them.

This resolves:
    http://tracker.ceph.com/issues/4868

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: define rbd_snap_size() and rbd_snap_features()
Alex Elder [Tue, 30 Apr 2013 05:44:33 +0000 (00:44 -0500)]
rbd: define rbd_snap_size() and rbd_snap_features()

This patch defines a handful of new functions that will allow
us to get rid of the rbd device structure's list of snapshots.

Define rbd_snap_id_by_name() to look up a snapshot id given its
name.  This is efficient for format 1 images but not for format 2.
Fortunately it only gets called at mapping time so it's not that
critical.

Use rbd_snap_id_by_name() to find out the id for a snapshot getting
mapped, and pass that id to new functions rbd_snap_size() and
rbd_snap_features() to look up information about a given snapshot's
size and feature mask given its snapshot id.  All this gets done
in rbd_dev_mapping_set().

As a result, snap_by_name() is no longer needed, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: use snap_id not index to look up snap info
Alex Elder [Tue, 30 Apr 2013 05:44:33 +0000 (00:44 -0500)]
rbd: use snap_id not index to look up snap info

In order to align with what was needed for format 1 rbd images,
rbd_dev_v2_snap_info() was set up to take as argument an index into
the array of snapshot ids in a rbd device's snapshot context.

This switches that around, so we pass the snapshot id instead.
In doing this, rbd_snap_name() now returns a dynamically-allocated
string rather than a fixed one, so there's no need to make a
duplicate in its caller, rbd_dev_spec_update().

This means the following functions take a snapshot id where they
previously used an index value:
    rbd_dev_snap_info()
    rbd_dev_v1_snap_info()
    rbd_dev_v2_snap_info()

A new function, rbd_dev_snap_index(), determines the snap index for
format 1 images and uses it to look up the name.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: look up snapshot name in names buffer
Alex Elder [Tue, 30 Apr 2013 05:44:33 +0000 (00:44 -0500)]
rbd: look up snapshot name in names buffer

Rather than scanning the list of snapshot structures for it, scan
the snapshot context buffer containing snapshot names in order to
determine for a format 1 image the name associated with a given
snapshot id.

Pull out the part of rbd_dev_v1_snap_info() that does this scan into
a new function, _rbd_dev_v1_snap_name().  Have that function return
a dynamically-allocated copy of the name, and don't duplicate it in
rbd_dev_v1_snap_info().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: drop obj_request->version
Alex Elder [Tue, 30 Apr 2013 05:44:33 +0000 (00:44 -0500)]
rbd: drop obj_request->version

Nothing ever uses the version field maintained in the object request
structure any more, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: drop rbd_obj_method_sync() version parameter
Alex Elder [Tue, 30 Apr 2013 05:44:33 +0000 (00:44 -0500)]
rbd: drop rbd_obj_method_sync() version parameter

Only NULL is passed as the version argument to rbd_obj_method_sync(),
so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: more version parameter removal
Alex Elder [Tue, 30 Apr 2013 05:44:33 +0000 (00:44 -0500)]
rbd: more version parameter removal

Continued from the last patch, more parameters that can go away
because we no longer have a need to track object versions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: get rid of some version parameters
Alex Elder [Tue, 30 Apr 2013 05:44:33 +0000 (00:44 -0500)]
rbd: get rid of some version parameters

Several functions in rbd have parameters meant to allow the version
of an object to be passed in or out.  The purpose of those was to
allow the version of a header object to be maintained, but we no
longer do that.  As a result, these parameters are never actually
needed or used, so get rid of them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: stop tracking header object version
Alex Elder [Tue, 30 Apr 2013 05:44:32 +0000 (00:44 -0500)]
rbd: stop tracking header object version

The rbd code takes care to maintain the version of the header
object.  This was done in hopes of using it to detect a change in
the object between reading it and setting up a watch request to
be notified of changes.

The mechanism was never fully implemented, however.  And we now
avoid the original problem by setting up the watch request before
ever reading the content of the header.

The osd doesn't interpret the object version supplied with a WATCH
osd op, nor does it use the version supplied with a NOTIFY_ACK op
(we can just supply 0 for both).  There is therefore no need to
maintain the header's object version any more, so stop doing so.

We'll be able to simplify some more rbd code in the next few patches
as a result of this.

This resolves:
    http://tracker.ceph.com/issues/3952

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: snap names are pointer to constant data
Alex Elder [Tue, 30 Apr 2013 05:44:33 +0000 (00:44 -0500)]
rbd: snap names are pointer to constant data

Make explicit that snapshot names don't change by making functions
return and take parameters that that point to const qualified data.

This resolves:
    http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: don't revalidate so much
Alex Elder [Tue, 30 Apr 2013 05:44:32 +0000 (00:44 -0500)]
rbd: don't revalidate so much

Whenever a header object event causes a mapped rbd image to refresh
its header information, revalidate_disk() is being called.  This was
done in rbd_dev_refresh() outside the control mutex in order to
avoid a lock inversion.  Although a an event like this *might*
indicate the image has changed size, most of the time it does not.

Record the image size before and after the refresh, and only
call revalidate_disk() if it changes.

This resolves:
    http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: fix up the layering warning message
Alex Elder [Tue, 30 Apr 2013 05:44:32 +0000 (00:44 -0500)]
rbd: fix up the layering warning message

A warning gets spewed for any image being probed, including parent
images.  Set up a condition such that the warning message only gets
printed for the image being mapped, not any of its parents.

Also, I didn't like the way the warning ended up being so long.
Make it a terse warning instead.  People experimenting with layering
will know what the message means.

This is part of:
    http://tracker.ceph.com/issues/4867

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agoceph: use ceph_create_snap_context()
Alex Elder [Tue, 30 Apr 2013 05:44:32 +0000 (00:44 -0500)]
ceph: use ceph_create_snap_context()

Now that we have a library routine to create snap contexts, use it.

This is part of:
    http://tracker.ceph.com/issues/4857

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agolibceph: create source file "net/ceph/snapshot.c"
Alex Elder [Tue, 30 Apr 2013 05:44:32 +0000 (00:44 -0500)]
libceph: create source file "net/ceph/snapshot.c"

This creates a new source file "net/ceph/snapshot.c" to contain
utility routines related to ceph snapshot contexts.  The main
motivation was to define ceph_create_snap_context() as a common way
to create these structures, but I've moved the definitions of
ceph_get_snap_context() and ceph_put_snap_context() there too.
(The benefit of inlining those is very small, and I'd rather
keep this collection of functions together.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: set up devices only for mapped images
Alex Elder [Mon, 29 Apr 2013 04:32:34 +0000 (23:32 -0500)]
rbd: set up devices only for mapped images

Stop setting up Linux devices during the image probe operation.
Instead, set up the devices as a separate step after the image
probe, in rbd_add().

A consequence of this is that only mapped images get devices
assigned to them, which is pretty sweet.

This resolves:
    http://tracker.ceph.com/issues/4774

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: don't have device release destroy rbd_dev
Alex Elder [Mon, 29 Apr 2013 04:32:34 +0000 (23:32 -0500)]
rbd: don't have device release destroy rbd_dev

Currently an rbd_device structure gets destroyed from the release
routine for the device embedded within it.  Stop doing that, instead
calling rbd_dev_image_release() right after rbd_bus_del_dev()
wherever the latter is called.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: define rbd_dev_unprobe()
Alex Elder [Mon, 29 Apr 2013 04:32:34 +0000 (23:32 -0500)]
rbd: define rbd_dev_unprobe()

Define a new function rbd_dev_unprobe() which undoes state changes
that occur from calling rbd_dev_v1_probe() or rbd_dev_v2_probe().
Note that this is a superset of rbd_header_free(), which is now
getting removed (it seems to have been used improperly anyway).

Flesh out rbd_dev_image_release() so it undoes exactly what
rbd_dev_image_probe() does.

This means that:
    - rbd_dev_device_release() gets called when the last device
      reference gets dropped;
    - that undoes everything done by the rbd_dev_device_setup() call
      at the end of rbd_dev_image_probe() (and nothing more), ending
      by calling rbd_dev_image_release(); and
    - rbd_dev_image_release() undoes everything else done by
      rbd_dev_image_probe() (and this includes a call to
      rbd_dev_unprobe().

This means the image and device portions of an rbd device are fairly
cleanly separated now, so error paths should be a little easier to
verify than they used to be.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: don't destroy rbd_dev in device release function
Alex Elder [Mon, 29 Apr 2013 04:32:34 +0000 (23:32 -0500)]
rbd: don't destroy rbd_dev in device release function

Rename rbd_dev_probe_finish() to be rbd_dev_device_setup().  Its
purpose is to set up the Linux side of an rbd device mapping.
Rename rbd_dev_release() to be rbd_dev_device_release(), making
it more obvious it serves as the inverse of the setup function
(or it will).

Encapsulate some of what was done in rbd_dev_release() into a new
function rbd_dev_image_release(), which serves as the inverse of
setting up the ceph side of the mapped rbd image.

Define a new helper rbd_dev_clear_mapping() to simply zero out the
fields of a mapping structure--the inverse of rbd_dev_set_mapping().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: drop module later
Alex Elder [Mon, 29 Apr 2013 04:32:34 +0000 (23:32 -0500)]
rbd: drop module later

Drop the module reference at the end of rbd_remove() for symmetry
with adding a reference at the top of rbd_add().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: set up watch in rbd_dev_image_probe()
Alex Elder [Sat, 27 Apr 2013 14:59:31 +0000 (09:59 -0500)]
rbd: set up watch in rbd_dev_image_probe()

Move setting up the watch request for an image so it's done in
rbd_dev_image_probe() rather than rbd_dev_probe_finish().  Move
it all the way up to before doing the initial probe.  This avoids
a potential race condition, in which we get (and use) the initial
snapshot context for an image, and it gets changed between that
time and the time we get the watch set up.

This resolves:
    http://tracker.ceph.com/issues/3871

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: don't bother checking whether order changes
Alex Elder [Sat, 27 Apr 2013 14:59:31 +0000 (09:59 -0500)]
rbd: don't bother checking whether order changes

When a format 2 image is refreshed, code is in place to verify that
the object order never changes from what it was originally.  This
relies on the fact that the refresh will occur *after* an initial
load of information about the image.

An upcoming patch makes it possible for the refresh to occur first,
so we can no longer make this order check.  The order really can't
ever change anyway--this was just a sanity check.  So get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: don't clean up watch in device release function
Alex Elder [Sat, 27 Apr 2013 14:59:30 +0000 (09:59 -0500)]
rbd: don't clean up watch in device release function

Currently, a watch on an rbd device header object gets torn down
when its final Linux device reference gets dropped.  Instead, tear
it down when removing the device.  If an error occurs cleaning up
the watch event when unmapping, abort the unmap request.

All images (including parents) still get watch requests set up, so
tear these down also, in rbd_dev_remove_parent().  For now, ignore
any errors that occur in this case.

Get rid of local variable "rc" in rbd_remove(); use "ret" instead
(they both somehow ended up defined in the function and only one is
needed).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: define rbd_header_name()
Alex Elder [Sat, 27 Apr 2013 14:59:30 +0000 (09:59 -0500)]
rbd: define rbd_header_name()

Define a new function rbd_header_name(), which allocates and formats
the name of the header object for the rbd device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: move more initialization into rbd_dev_image_probe()
Alex Elder [Sat, 27 Apr 2013 14:59:30 +0000 (09:59 -0500)]
rbd: move more initialization into rbd_dev_image_probe()

Move a block of initialization related to the "ceph-side" of an rbd
image out of rbd_dev_probe_finish() and into rbd_dev_image_probe().

Add appropriate error handling to clean things up in the event any
of these new functions return an error.

We know that rbd_dev_snaps_update(), rbd_dev_spec_update(), and
rbd_dev_probe_parent() all clean up after themselves before they
return an error, so no special cleanup is required except when an
earlier call succeeds.  Since rbd_dev_spec_update() only updates the
spec field (whose cleanup will be handled by dropping the last
reference to the spec) there is no cleanup action associatied with
that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: probe for the parent earlier
Alex Elder [Fri, 26 Apr 2013 20:44:37 +0000 (15:44 -0500)]
rbd: probe for the parent earlier

Probe for a parent device earlier in rbd_dev_probe_finish(), before
starting to set up the Linux side of the rbd device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: remove parent devices on probe error
Alex Elder [Fri, 26 Apr 2013 20:44:36 +0000 (15:44 -0500)]
rbd: remove parent devices on probe error

When an error occurs while finishing probing a device it is assumed
that parent devices get cleaned up when deleting a device.  They
don't.  Add a call to clean them up.  Note that this means the
parent spec will already be cleaned up so it doesn't have to be
in one of the rbd_add() error paths.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: fix rbd_dev_remove_parent()
Alex Elder [Fri, 26 Apr 2013 20:44:36 +0000 (15:44 -0500)]
rbd: fix rbd_dev_remove_parent()

In certain error paths, it is possible for an rbd device to have a
parent spec but no parent rbd_dev.  In rbd_dev_remove_parent() use
the parent field rather than parent_spec in determining whether to
try to remove any parent devices.  Use assertions to indicate that
any non-null parent pointer has parent_spec associated with it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: kill __rbd_remove()
Alex Elder [Fri, 26 Apr 2013 20:44:36 +0000 (15:44 -0500)]
rbd: kill __rbd_remove()

The function __rbd_remove() is used in two spots, and it's fairly
simple.  It combines cleanup of part of the ceph-side state as well
as cleaning up the Linux-side state.  Just open code it in the two
callers and eliminate the function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: set mapping info earlier
Alex Elder [Sat, 27 Apr 2013 14:59:30 +0000 (09:59 -0500)]
rbd: set mapping info earlier

Set the mapping size and features earlier in rbd_dev_probe_finish().

Define rbd_dev_mapping_clear() as an inverse for setting those
fields, and use it both in error handling in rbd_dev_image_probe()
and in the final cleanup in rbd_dev_release().  Change the name
of rbd_dev_set_mapping() to of rbd_dev_mapping_set().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: encapsulate removing parent devices
Alex Elder [Fri, 26 Apr 2013 20:44:36 +0000 (15:44 -0500)]
rbd: encapsulate removing parent devices

Encapsulate the code that removes an rbd device's parent images into
a new function, rbd_dev_remove_parent().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: encapsulate probing for parent devices
Alex Elder [Fri, 26 Apr 2013 20:44:36 +0000 (15:44 -0500)]
rbd: encapsulate probing for parent devices

Encapsulate the code that probes for an rbd device's parent images
into a new function, rbd_dev_probe_parent().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: defer setting disk capacity
Alex Elder [Fri, 26 Apr 2013 20:44:36 +0000 (15:44 -0500)]
rbd: defer setting disk capacity

Don't set the disk capacity until right before we announce the
device as available for use.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: only set device exists flag when ready
Alex Elder [Fri, 26 Apr 2013 20:44:36 +0000 (15:44 -0500)]
rbd: only set device exists flag when ready

Hold off setting the EXISTS rbd device flag until just before we
announce the disk as available for use.  There's no point in doing
so any earlier than that, and at that point the device truly is
fully set up and ready to use.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: fix up some sysfs stuff
Alex Elder [Fri, 26 Apr 2013 20:44:36 +0000 (15:44 -0500)]
rbd: fix up some sysfs stuff

This just tweaks a few things in the routines that implement
rbd sysfs files.

All of the entries for an rbd device in /sys/bus/rbd/devices/<id>/
will represent information whose valid values are known by the time
they are accessible.

Right now we get the size of the mapped image by a call to
get_capacity().  There's no need to do this, because that will
return what we last set the capacity to, which is just the size
recorded for the mapping.  So just show that value instead.

We also get this under protection of the header semaphore, in order
to provide a precisely correct value.  This isn't really necessary;
these files are really informational only and it's not necessary to
be so careful.

Finally, print a special value in case the major device number is
not recorded.  Right now that won't matter much but soon the parent
images won't have devices associated with them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: fix a bug in resizing a mapping
Alex Elder [Fri, 26 Apr 2013 20:44:35 +0000 (15:44 -0500)]
rbd: fix a bug in resizing a mapping

When a snapshot context update occurs, rbd_update_mapping_size() is
called to set the capacity of the disk to record the updated
size of the image in case it has changed.

There's a bug though.  The mapping size is in units of *bytes*.  The
code that updates the mapping size field is assigning a value that
has been scaled down to *sectors*.

Fix that.  Also, check to see if the size has actually changed, and
don't bother updating things (specifically, calling set_capacity())
if it has not.

This resolves:
    http://tracker.ceph.com/issues/4833

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: refactor rbd_dev_probe_update_spec()
Alex Elder [Fri, 26 Apr 2013 14:43:48 +0000 (09:43 -0500)]
rbd: refactor rbd_dev_probe_update_spec()

Fairly straightforward refactoring of rbd_dev_probe_update_spec().
The name is changed to rbd_dev_spec_update().

Rearrange it so nothing gets assigned to the spec until all of the
names have been successfully acquired.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: rename rbd_dev_probe()
Alex Elder [Fri, 26 Apr 2013 14:43:48 +0000 (09:43 -0500)]
rbd: rename rbd_dev_probe()

Rename rbd_dev_probe() to be rbd_dev_image_probe().  Its purpose
will eventually be to probe for the existence of a valid rbd image
for the rbd device--focusing only on the ceph side and not the Linux
device side of initialization.

For now the two "sides" are not fully separated, and this function
is still the entry point for initializing the full rbd device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: make rbd_dev_destroy() match rbd_dev_create()
Alex Elder [Fri, 26 Apr 2013 14:43:47 +0000 (09:43 -0500)]
rbd: make rbd_dev_destroy() match rbd_dev_create()

Currently, rbd_dev_destroy() does more than just the inverse of what
rbd_dev_create() does.  Stop doing that, and move the two extra
things it does into the three call sites.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: define rbd snap context routines
Alex Elder [Fri, 26 Apr 2013 14:43:47 +0000 (09:43 -0500)]
rbd: define rbd snap context routines

Encapsulate the creation of a snapshot context for rbd in a new
function rbd_snap_context_create().  Define rbd wrappers for getting
and dropping references to them once they're created.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: use rbd_warn(), not WARN_ON()
Alex Elder [Fri, 26 Apr 2013 14:43:47 +0000 (09:43 -0500)]
rbd: use rbd_warn(), not WARN_ON()

Change some calls to WARN_ON() so they use rbd_warn() instead, so we
get consistent messaging.  A few remain but they can probably just
go away eventually.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: move stripe_unit and stripe_count into header
Alex Elder [Fri, 26 Apr 2013 14:43:47 +0000 (09:43 -0500)]
rbd: move stripe_unit and stripe_count into header

This commit added fetching if fancy striping parameters:
    09186ddb rbd: get and check striping parameters

They are almost unused, but the two fields storing the information
really belonged in the rbd_image_header structure.

This patch moves them there.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: make rbd spec names pointer to const
Alex Elder [Fri, 26 Apr 2013 14:43:47 +0000 (09:43 -0500)]
rbd: make rbd spec names pointer to const

Make the names and image id in an rbd_spec be pointers to constant
data.  This required the use of a local variable to hold the
snapshot name in rbd_add_parse_args() to avoid a warning.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: set snapshot id in rbd_dev_probe_update_spec()
Alex Elder [Fri, 26 Apr 2013 04:15:08 +0000 (23:15 -0500)]
rbd: set snapshot id in rbd_dev_probe_update_spec()

Set the rbd spec's snapshot id for an image getting mapped in
rbd_dev_probe_update_spec() rather than rbd_dev_set_mapping().
This is the more logical place for that to happen (even though
it means we might look up the snapshot by name twice).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: have snap_by_name() return a snapshot
Alex Elder [Fri, 26 Apr 2013 04:15:08 +0000 (23:15 -0500)]
rbd: have snap_by_name() return a snapshot

A function called snap_by_name() ought to just look up a snapshot by
name.  It does that, but then it assigns some stuff to the rbd
device structure as well.

Change the function to do just the lookup, and have the caller do
the assignments that follow.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: fix image id leak in initial probe
Alex Elder [Fri, 26 Apr 2013 04:15:08 +0000 (23:15 -0500)]
rbd: fix image id leak in initial probe

If a format 2 image id is found for an image being mapped, but the
subsequent probe of the image fails, rbd_dev_probe() quits without
freeing the image id.  Fix that.

Also drop a redundant hunk of code in rbd_dev_image_id().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: have rbd_dev_image_id() set format 1 image id
Alex Elder [Fri, 26 Apr 2013 04:15:08 +0000 (23:15 -0500)]
rbd: have rbd_dev_image_id() set format 1 image id

Currently, rbd_dev_probe() assumes that any error returned by
rbd_dev_image_id() is most likely -ENOENT, and responds by
calling the format 1 probe routine, rbd_dev_v1_probe().  Then,
at the top of rbd_dev_v1_probe(), an empty string is allocated
for the image id.

This is sort of unbalanced.  Fix this by having rbd_dev_image_id()
look for -ENOENT from its "get_id" method call.  If that is seen,
have it allocate the empty string there rather than depending on
rbd_dev_v1_probe() to do it.

Given that this is effectively defining the format of the image,
set rbd_dev->image_format inside rbd_dev_image_id() rather than in
the format-specific probe routines.

Also drop a redundant hunk of code in rbd_dev_image_id().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: avoid dropping extra reference in rbd_free_disk()
Alex Elder [Fri, 26 Apr 2013 04:15:08 +0000 (23:15 -0500)]
rbd: avoid dropping extra reference in rbd_free_disk()

I found during some failure injection testing that the call to
rbd_free_disk() in the error path of rbd_dev_probe_finish() was
dropping an extra reference to the disk queue.  The problem
occurred when put_disk tried to drop a reference to the disk's
queue.  A call to blk_cleanup_queue() just prior to that will have
also dropped a reference to the queue.

The problem is that the reference dropped by put_disk() is assumed
to have been taken by add_disk().  Our code has error paths that can
occur after the disk and its queue are initialized, but before the
call to add_disk(), and in those paths we won't have that extra
reference.

The fix is easy though.  In rbd_free_disk() we're already checking
the disk's GENHD_FL_UP flag.  That flag is an indication that
add_disk() has been called, so just call blk_cleanup_queue()
conditional on that flag being set.

This resolves:
    http://tracker.ceph.com/issues/4800

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: use rbd_obj_method_sync() return value
Alex Elder [Thu, 25 Apr 2013 20:09:42 +0000 (15:09 -0500)]
rbd: use rbd_obj_method_sync() return value

Now that rbd_obj_method_sync() returns the number of bytes
returned by the method call, that value should be used by
callers to ensure we don't overrun the valid portion of the
buffer.

Fix the two spots that remained that weren't doing that,
rbd_dev_image_name() and rbd_dev_v2_snap_name().

Rearrange the error path slightly in rbd_dev_v2_snap_name().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: fix leak of format 2 snapshot names
Alex Elder [Thu, 25 Apr 2013 20:09:42 +0000 (15:09 -0500)]
rbd: fix leak of format 2 snapshot names

When the snapshot context for an rbd device gets updated (or the
initial one is recorded) a a list of snapshot structures is created
to represent them, one entry per snapshot.  Each entry includes a
dynamically-allocated copy of the snapshot name.

Currently the name is allocated in rbd_snap_create(), as a duplicate
of the passed-in name.

For format 1 images, the snapshot name provided is just a pointer to
an existing name.  But for format 2 images, the passed-in name is
already dynamically allocated, and in the the process of duplicating
it here we are leaking the passed-in name.

Fix this by dynamically allocating the name for format 1 snapshots
also, and then stop allocating a duplicate in rbd_snap_create().

Change rbd_dev_v1_snap_info() so none of its parameters is
side-effected unless it's going to return success.

This is part of:
    http://tracker.ceph.com/issues/4803

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: rename __rbd_add_snap_dev()
Alex Elder [Thu, 25 Apr 2013 20:09:41 +0000 (15:09 -0500)]
rbd: rename __rbd_add_snap_dev()

Rename __rbd_add_snap_dev() to be rbd_snap_create().  We no longer
have devices for non-mapped snapshots, and we're not actually
"adding" it to the list in this function, just creating it.

Rename rbd_remove_snap_dev() to be rbd_snap_destroy() for reasons
similar to the above.  Stop having this function delete the snapshot
from its list (to be symmetrical with its create counterpart) and do
that in the caller instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: only update values on snap_info success
Alex Elder [Thu, 25 Apr 2013 20:09:41 +0000 (15:09 -0500)]
rbd: only update values on snap_info success

Change rbd_dev_v2_snap_info() so it only ever sets values of the
size and features parameters if looking up the snapshot name was
successful.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: make snap_size order parameter optional
Alex Elder [Thu, 25 Apr 2013 20:09:41 +0000 (15:09 -0500)]
rbd: make snap_size order parameter optional

Only one of the two callers of _rbd_dev_v2_snap_size() needs the
order value returned.  So make that an optional argument--a null
pointer if the caller doesn't need it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: fix leak of snapshots during initial probe
Alex Elder [Thu, 25 Apr 2013 20:09:41 +0000 (15:09 -0500)]
rbd: fix leak of snapshots during initial probe

When an rbd image is initially mapped, its snapshot context is
collected, and then a list of snapshot entries representing the
snapshots in that context is created.  The list is created using
rbd_dev_snaps_update().  (This function also supports updating an
existing snapshot list based on a new snapshot context.)

If an error occurs, updating the list is aborted, and the list is
currently left as-is, in an inconsistent state.  At that point,
there may be a partially-constructed list, but the calling functions
(rbd_dev_probe_finish() from rbd_dev_probe() from rbd_add()) never
clean them up.  So this constitutes a leak.

A snapshot list that is inconsistent with the current snapshot
context is of no use, and might even be actively bad.  So rather
than just having the caller clean it up, have rbd_dev_snaps_update()
just clear out the entire snapshot list in the event an error
occurs.

The other place rbd_dev_snaps_update() is used is when a refresh is
triggered, either because of a watch callback or via a write to the
/sys/bus/rbd/devices/<id>/refresh interface.  An error while
updating the snapshots has no substantive effect in either of those
cases, but one of them issues a warning.  Move that warning to the
common rbd_dev_refresh() function so it gets issued regardless of
how it got initiated.

This is part of:
    http://tracker.ceph.com/issues/4803

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: don't create sysfs entries for non-mapped snapshots
Alex Elder [Tue, 23 Apr 2013 18:52:53 +0000 (13:52 -0500)]
rbd: don't create sysfs entries for non-mapped snapshots

When an rbd image gets mapped a device entry gets created for it
under /sys/bus/rbd/devices/<id>/.  Inside that directory there are
sysfs files that contain information about the image: its size,
feature bits, major device number, and so on.

Additionally, if that image has any snapshots, a device entry gets
created for each of those as a "child" of the mapped device.  Each
of these is a subdirectory of the mapped device, and each directory
contains a few files with information about the snapshot (its
snapshot id, size, and feature mask).

There is no clear benefit to having those device entries for the
snapshots.  The information provided via sysfs of of little real
value--and all of it is available via rbd CLI commands.  If we
still wanted to see the kernel's view of this information it could
be done much more simply by including it in a single sysfs file for
the mapped image.

But there *is* a clear cost to supporting them.  Every time a snapshot
context changes, these entries need to be updated (deleted snapshots
removed, new snapshots created).  The rbd driver is notified of
changes to the snapshot context via callbacks from an osd, and care
must be taken to coordinate removal of snapshot data structures
with the possibility of one these notifications occurring.

Things would be considerably simpler if we just didn't have to
maintain device entries for the snapshots.

So get rid of them.

The ability to map a snapshot of an rbd image will remain; the only
thing lost will be the ability to query these sysfs directories for
information about snapshots of mapped images.

This resolves:
    http://tracker.ceph.com/issues/4796

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agolibceph: fix byte order mismatch
Alex Elder [Sun, 21 Apr 2013 21:51:50 +0000 (16:51 -0500)]
libceph: fix byte order mismatch

A WATCH op includes an object version.  The version that's supplied
is incorrectly byte-swapped osd_req_op_watch_init() where it's first
assigned (it's been this way since that code was first added).

The result is that the version sent to the osd is wrong, because
that value gets byte-swapped again in osd_req_encode_op().  This
is the source of a sparse warning related to improper byte order in
the assignment.

The approach of using the version to avoid a race is deprecated
(see http://tracker.ceph.com/issues/3871), and the watch parameter
is no longer even examined by the osd.  So fix the assignment in
osd_req_op_watch_init() so it no longer does the byte swap.

This resolves:
    http://tracker.ceph.com/issues/3847

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: activate support for layered images
Alex Elder [Fri, 26 Oct 2012 04:34:40 +0000 (23:34 -0500)]
rbd: activate support for layered images

Now that we have most everything in place to support layered rbd
images, enable support for them in the kernel client.  Issue a
warning to the log that the support is considered experimental
whenever a format 2 layered image is mapped.

Note that we also have to claim to support the STRIPINGV2 feature,
due to a mistake in the way the rbd CLI set up those flags.  This
feature can work if it has the right parameters, and safeguards
have been put in place to reject those images that do not have
compatible parameters.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: get and check striping parameters
Alex Elder [Sun, 21 Apr 2013 17:14:45 +0000 (12:14 -0500)]
rbd: get and check striping parameters

If an rbd format 2 image indicates it supports the STRIPINGV2
feature we need to find out its stripe unit and stripe count in
order to know whether we can use it.  We don't yet support fancy
striping fully, but if the default parameters are used the behavior
is indistinguishible from non-fancy striping.

This is necessary because some images require the STRIPINGV2 feature
even if they use the default parameters.  (Which is to say the feature
bit was erroneously set even if the feature was not used.)

This resolves:
    http://tracker.ceph.com/issues/4709

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: have rbd_obj_method_sync() return transfer count
Alex Elder [Sun, 21 Apr 2013 17:14:45 +0000 (12:14 -0500)]
rbd: have rbd_obj_method_sync() return transfer count

Callers of rbd_obj_method_sync() don't know how many bytes of data
got returned by the class method call.  As a result, they have been
assuming enough got returned to decode whatever was expected.

This isn't safe.  We know how many bytes got transferred, so have
rbd_obj_method_sync() return that amount (rather than just 0) if
the call is successful.

Change all callers to use this return value to ensure decoding of
the results is done safely.

On the other hand, most callers of rbd_obj_method_sync() only
indicate success or failure, so all of *their* callers can simply
test for non-zero result.

This resolves:
    http://tracker.ceph.com/issues/4773

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: void data pointers for rbd_obj_method_sync()
Alex Elder [Sun, 21 Apr 2013 17:14:45 +0000 (12:14 -0500)]
rbd: void data pointers for rbd_obj_method_sync()

Make the inbound and outbound data parameters have void rather than
character type for rbd_obj_method_sync().  This makes it more clear
they don't expect typed data, and eliminates the need for some silly
type casts.

One more unrelated change: define the features buffer used in
_rbd_dev_v2_snap_features() to be a packed data structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: give rbd_obj_read_sync() buffer void type
Alex Elder [Sun, 21 Apr 2013 17:14:45 +0000 (12:14 -0500)]
rbd: give rbd_obj_read_sync() buffer void type

Make the buf parameter into which the data is to be read have type
void pointer.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agolibceph: validate timespec conversions
Alex Elder [Fri, 19 Apr 2013 20:34:50 +0000 (15:34 -0500)]
libceph: validate timespec conversions

A ceph timespec contains 32-bit unsigned values for its seconds and
nanoseconds components.  For a standard timespec, both fields are
signed, and the seconds field is almost surely 64 bits.

Add some explicit casts so the fact that this conversion is taking
place is obvious.  Also trip a bug if we ever try to put out of
range (negative or too big) values into a ceph timespec.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agolibceph: add signed type limits
Alex Elder [Fri, 19 Apr 2013 20:34:50 +0000 (15:34 -0500)]
libceph: add signed type limits

Flesh out the limits defined in <linux/ceph/decode.h> to include the
maximum and minimum values for signed type S8, S16, S32, and S64.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: enforce parent overlap
Alex Elder [Sun, 21 Apr 2013 05:32:07 +0000 (00:32 -0500)]
rbd: enforce parent overlap

A clone image has a defined overlap point with its parent image.
That is the byte offset beyond which the parent image has no
defined data to back the clone, and anything thereafter can be
viewed as being zero-filled by the clone image.

This is needed because a clone image can be resized.  If it gets
resized larger than the snapshot it is based on, the overlap defines
the original size.  If the clone gets resized downward below the
original size the new clone size defines the overlap.  If the clone
is subsequently resized to be larger, the overlap won't be increased
because the previous resize invalidated any parent data beyond that
point.

This resolves:
    http://tracker.ceph.com/issues/4724

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: issue a copyup for layered writes
Alex Elder [Fri, 19 Apr 2013 20:34:50 +0000 (15:34 -0500)]
rbd: issue a copyup for layered writes

This implements the main copyup functionality for layered writes.

Here we add a copyup_pages field to the object request, which is
used only for copyup requests to keep track of the page array
containing data read from the parent image.

A copyup request is currently the only request rbd has that requires
two osd operations.  Because of this we handle copyup specially.
All image object requests get an osd request allocated when they are
created.  For a write request, if a copyup is required, the osd
request originally allocated is released, and a new one (with room
for two osd ops) is allocated to replace it.  A new function
rbd_osd_req_create_copyup() allocates an osd request suitable for
a copyup request.

The first op is then filled with a copyup object class method call,
supplying the array of pages containing data read from the parent.
The second op is filled in with the original write request.

The original request otherwise remains intact, and it describes the
original write request (found in the second osd op).  The presence
of the copyup op is sort of implicit; a non-null copyup_pages field
could be used to distinguish between a "normal" write request and a
request containing both a copyup call and a write.

This resolves:
    http://tracker.ceph.com/issues/3419

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: implement full object parent reads
Alex Elder [Fri, 19 Apr 2013 20:34:50 +0000 (15:34 -0500)]
rbd: implement full object parent reads

As a step toward implementing layered writes, implement reading the
data for a target object from the parent image for a write request
whose target object is known to not exist.  Add a copyup_pages field
to an image request to track the page array used (only) for such a
request.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: revalidate_disk upon rbd resize
Laurent Barbe [Wed, 10 Apr 2013 22:47:46 +0000 (17:47 -0500)]
rbd: revalidate_disk upon rbd resize

If rbd disk is open and rbd resize is done, new size is not
visible by filesystem.  Like is done in virtio-blk and dm driver,
revalidate_disk() permits to update the bd_inode size.

Signed-off-by: Laurent Barbe <laurent@ksperis.com>
Reviewed-by: Alex Elder <elder@inktank.com>
11 years agorbd: support page array image requests
Alex Elder [Fri, 19 Apr 2013 20:34:50 +0000 (15:34 -0500)]
rbd: support page array image requests

This patch adds the ability to build an image request whose data
will be written from or read into memory described by a page array.
(Previously only bio lists were supported.)

Originally this was going to define a new function for this purpose
but it was largely identical to the rbd_img_request_fill_bio().  So
instead, rbd_img_request_fill_bio() has been generalized to handle
both types of image request.

For the moment we still only fill image requests with bio data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: define zero_pages()
Alex Elder [Fri, 19 Apr 2013 20:34:50 +0000 (15:34 -0500)]
rbd: define zero_pages()

Define a new function zero_pages() that zeroes a range of memory
defined by a page array, along the lines of zero_bio_chain().  It
saves and the irq flags like bvec_kmap_irq() does, though I'm not
sure at this point that it's necessary.

Update rbd_img_obj_request_read_callback() to use the new function
if the object request contains page rather than bio data.

For the moment, only bio data is used for osd READ ops.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: encapsulate submission of image object requests
Alex Elder [Fri, 19 Apr 2013 20:34:50 +0000 (15:34 -0500)]
rbd: encapsulate submission of image object requests

Object requests that are part of an image request are subject to
some additional handling.  Define rbd_img_obj_request_submit() to
encapsulate that, and use it when initially submitting an image
object request, and when re-submitting it during callback of
an object existence check.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agorbd: define separate read and write format funcs
Alex Elder [Fri, 19 Apr 2013 20:34:50 +0000 (15:34 -0500)]
rbd: define separate read and write format funcs

Separate rbd_osd_req_format() into two functions, one for read
requests and the other for write requests.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agolibceph: support pages for class request data
Alex Elder [Fri, 19 Apr 2013 20:34:49 +0000 (15:34 -0500)]
libceph: support pages for class request data

Add the ability to provide an array of pages as outbound request
data for object class method calls.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agolibceph: fix two messenger bugs
Alex Elder [Fri, 19 Apr 2013 20:34:49 +0000 (15:34 -0500)]
libceph: fix two messenger bugs

This patch makes four small changes in the ceph messenger.

While getting copyup functionality working I found two bugs in the
messenger.  Existing paths through the code did not trigger these
problems, but they're fixed here:
    - In ceph_msg_data_pagelist_cursor_init(), the cursor's
      last_piece field was being checked against the length
      supplied.  This was OK until this commit: ccba6d98 libceph:
      implement multiple data items in a message That commit changed
      the cursor init routines to allow lengths to be supplied that
      exceeded the size of the current data item. Because of this,
      we have to use the assigned cursor resid field rather than the
      provided length in determining whether the cursor points to
      the last piece of a data item.
    - In ceph_msg_data_add_pages(), a BUG_ON() was erroneously
      catching attempts to add page data to a message if the message
      already had data assigned to it. That was OK until that same
      commit, at which point it was fine for messages to have
      multiple data items. It slipped through because that BUG_ON()
      call was present twice in that function. (You can never be too
      careful.)

In addition two other minor things are changed:
    - In ceph_msg_data_cursor_init(), the local variable "data" was
      getting assigned twice.
    - In ceph_msg_data_advance(), it was assumed that the
      type-specific advance routine would set new_piece to true
      after it advanced past the last piece. That may have been
      fine, but since we check for that case we might as well set it
      explicitly in ceph_msg_data_advance().

This resolves:
    http://tracker.ceph.com/issues/4762

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>