[flud-devel] DHT justifications (was: DHT Performance? Design?)

Bill Broadley bill at broadley.org
Tue Nov 6 22:15:18 PST 2007

Alen Peacock wrote:
> On Nov 5, 2007 10:43 PM, Bill Broadley <bill at broadley.org> wrote:
>> Hmmm, seems like one thing you could do is challenge the requestor that want
>> to register for a particular file.
>   Yes.  http://www.flud.org/bugzilla/show_bug.cgi?id=65


>   Not when bug #65 is addressed.


>>> keep things simple, I also lean towards symmetric trading
>>> relationships (1byte for 1byte)
>> Doesn't that mostly remove the advantage of your content addressable network
>> and the payback of replicating metadata 12 times?
>   How so?

Well the person looking for storage goes through the DHT, finds the magic
node and might well end up saving zero bytes of storage for his trouble.

>   Not relatively random.  If the file has been stored for any amount
> of time, it's blocks will have already found their way to highly
> reliable/available nodes.

Storage blocks or metadata blocks?  For metadata it should migrate to
nodes who's nodeID are close to the sha256.  For data it's whoever you ask and
agrees.  Seems to make the most sense to offer a contract to whoever holds
your metadata, that way you only need 1 machine up to get your blocks back
instead of 2.

>   Aha.  I think I see why this bothers you so much.  There is *1* DHT
> lookup per file, /not/ 1 per block.

Ah, big mistake... sorry.  I kinda inferred that from your description
of the 3 kinds of metadata, in particular:
2) block metadata, which contains the names of blocks belonging to a
file (sha256) and where they are stored (nodeIDs).  This is stored in
the DHT.

>   When a node first joins the flud network, it does the standard DHT
> join op which populates its own routing tables.  It also caches many
> of the nodes that it encounters between runs.  During its first backup
> op, it chooses from among those nodes already in its local cache.  The
> tables may fill out more as storage keys are looked up or stored.
> Success on initial storage/verify ops creates trust, and the node
> naturally starts to prefer nodes that it has some history with.

Ok.  Rereading the story of a file this now more makes more sense to me.
So the 20 blocks can be anywhere, but the metadata is in one place,
er in one place in the DHT + replications.

>   Now, during restore it might have to do a lookup on some of the
> nodeIDs where blocks are stored.  But this shouldn't be done very
> often, since once it finds a node it will again cache it, and since
> most of a node's trading is done with a small subset of peers. It will
> not need to query the DHT very often for block locations.  It will
> continue to do 1 DHT query for each file lookup.

So it's 20 times better than I thought, progress ;-).

>   Of course there are pathological cases where this would degenerate
> into 1 lookup per block, but those are pretty contrived.


>   And, yes, for files it chooses to store convergently with
> already-stored blocks, it may have to do a few lookups.  But I think
> that is an very fair tradeoff.


>   They are 1/20th the filesize.  Currently, there are also 20 parity
> blocks created.  So 1/2 (20) of all blocks can be lost and the file is
> still recoverable.  As I mentioned previously, these numbers will need
> to change as we learn what the right values are from usage in testing,
> but with the trust system in place, I tend to think that they are
> overly pessimistic, i.e., we may be able to get away with a lot less
> erasure coding overhead in practice.

Agreed, nothing like actual numbers.

>> Sounds very reasonable.  Although I see nothing that would preclude flud
>> being used as an archive service and an arbitrary number of files, versions,
>> etc.  As long as the storage node is around to "pay" for that storage.
>   If a node stops doing verify ops, the storing node can discard the
> data.

Right, that's the part where you "pay".  Presumably a contract or agreement
would cover exactly what window to cover with zero loss.  Then once it
expires some kind of decay function.

>  In order to do verify ops, a node must possess a copy of the
> data.

Er, what?  If I ask you to store a file, I prove I have it, you should
store it for as long as we have an agreement.  You can of course challenge
any file I've stored for you.  But you shouldn't continuously make me
reprove I have the data.  It is after all a backup service.

Hrm, actually now that you mention it, is there any notion of a version
of a file?    Or a negative file?  Say I have 10 files in a dir A on Monday 
(with a backup), then 5 files in the same dir on Tuesday (with a backup), and 
15 on wednesday (with a backup).  Can I restore using flud to the state on 
Monday, Tuesday, and Wednesday?

In any case I'm all for proving I had a file I asked you to store once, but
from then on as long as we have a happy relationship you should keep storing
said file.  If I want 300 snapshots of my filesystem that take a total of 
300GB you should be perfectly willing to store them in exchange for 300GB
which is just a single copy of your filesystem.

As long as peers agree on the amount of storage to exchange they shouldn't
place arbitrary restrictions on how that storage is used.  Sensible?

>   Hrm, except that isn't exactly true currently... if the initiating
> node knows how large the file is, it can do correct verify ops without
> the storing node detecting that it has faked it.  Looks like the
> verify op needs to be modified to do a 2-way challenge.  This should
> be fairly trivial.

Oh, by saying max = filesize, min = 0, then using the sha to prove it
has the file?  Instead why not make the challenge not allow whole

>> So say 10 peers need to store 1/20th of ubuntu i386 feisty /bin/login and they
>> all ask the DHT for someone with the sha256 of that 1/20th, all find the same
>> storage node (which we can call S) and all negotiate for the storage.
>   Minor correction: the 10 peers will ask the DHT for the block
> metadata for the file and get back the list of nodes where each of its
> blocks are stored.


>> Then
>> some nodes upgrade to gutsy, run a backup.  Would you expect those nodes to
>> ask S to store the new block?  Or just troll the DHT for someone to exchange
>> data with?
>   If gutsy's /bin/login is a different file than feisty's /bin/login
> (which it most likely will be), then the first node to back it up will
> find that no one has stored it yet, store it, and then the others can
> either choose to enter into trading relationships with the nodes that
> the first storer chose, or they can enter into new relationships with
> nodes of their own preference.

That makes sense.

>> Maybe just switch to whole files if smaller than 8KB?
>   Yes, that's a valid strategy, and I almost did it early on.  In
> order to get the same survivability as our current scheme (where we
> tolerate 20 failures), we'd need to store 21 copies of the small file.
>  That does reduce the metadata size by almost half, and might be worth
> it.  But it didn't seem to be much of a win on the fileset I compared
> with (see the earlier mailing list post that had some numbers for
> this).
>   This could be added later.

Ah, I had no idea that you could survive 20 failures for the cost of 2 x 
filesize.  Sounds like only replication x metadata records.  Each record
would be a list of (block checksum, nodeID).  How big is a nodeID anyways?

>> I think I understand most of flud, at least at the 50k foot level, I'm just
>> wrestling having agreements with:
>> * potentially (files in convergent storage * 20) peer agreements
>   Yeah... I really don't see how that could ever happen in practice.

Heh, yeah that pesky factor of 20 I mistook.  Well I don't care about peers
that try to run themselves inefficiently, dealing with more peers than they
need.  Fact is you could have one DHT entry (not including replication) per
file, and as few storage peers as you prefer.  My main concern is you
practically had to deal a zillion peers, and establish a relationship
with them, and there would be so many you could never actually track/audit

>> * a local dht store of:
>>    (average files per peer) * 20 (erasure coding blocks) * 12 (DHT replication
>>    factor) records.
>   It's files*12 records.

Big difference, thanks.

>> * what fraction of an average users backup is going to be found in convergent
>>    storage that wouldn't be found among a small group of storage peers.
>   Hard to say.  My intuition is that unless you group users/systems by
> similarity, you're not going to see that much advantage to convergent
> storage for smaller groups.

Dunno, backuppc works impressively well.  I backup 53 systems, the sum of 
their storage ia 1.6TB, the sum of their incrementals is 35GB.  The sum of
all that (daily incrementals for a month + level 0s) is 600GB.  It used to
be much more impressive when I backed up a few hundred machines.  I know of
2 larger departments on campus if you want additional numbers.

So certainly global convergent storage could be much better.  At one file per 
2 seconds that would take 62 days for my store.  At 100 mbit the 600GB would 
take 0.7 days to transfer.  It does have 4 opteron CPUs and a healthy disk 
system that I do hope that with compression using one cpu, erasure coding
a second, and slinging around metadata with the 3rd cpu that I can at
least sustain 1MByte/sec if not 10 (with a GigE connection).

My testing of Dibs couldn't keep up with the churn in my homedir 8-(.

Granted I wasn't expecting to replace a departmental backup system with flud
tomorrow, but I thought it might be an interesting data point.

>   The DHT is custom, but mostly stand-alone.  flud recently switched
> to zfec for erasure coding (created by zooko for Tahoe).  Its
> performance is quite satisfactory compared to rizzo, but of course has
> to pay the python wrapper penalty.  You can find benchmarks in the
> expected places

Cool, thanks.

More information about the flud-devel mailing list