[flud-devel] DHT justifications (was: DHT Performance? Design?)
alenlpeacock at gmail.com
Tue Nov 6 20:11:34 PST 2007
On Nov 5, 2007 10:43 PM, Bill Broadley <bill at broadley.org> wrote:
> Hmmm, seems like one thing you could do is challenge the requestor that want
> to register for a particular file.
> > It also prevents malicious parties from downloading content
> > that they don't have posession of in the first place. In order to
> Sort of, it just requires generating a pair of keys and asking for a contract.
Not when bug #65 is addressed.
> > keep things simple, I also lean towards symmetric trading
> > relationships (1byte for 1byte)
> Doesn't that mostly remove the advantage of your content addressable network
> and the payback of replicating metadata 12 times?
> Well if I backup N files, I'm going to get N*20 blocks, and for all
> blocks that already exist I'm likely to get a relatively random
> set of peers, right?
Not relatively random. If the file has been stored for any amount
of time, it's blocks will have already found their way to highly
> > Mainly because we can pipeline things with lots of ops happening
> > asynchronously, I'm not too worried about the files-per-sec issue (but
> The paper I mentioned previously seemed to think that a lookup every 2 seconds
> was about as fast as you are every going to get, the improvement from 4
> threads to 8 seems minimal, even 10 threads looks like 2 seconds or so. If
> your bandwidth is higher than 40 (20 2 second lookups) seconds per average
> filesize it's likely to be a bottleneck.(at least if the paper and my
> interpretation of it is correct). My uplink at home is 45KB/sec or so, my
> average filesize is 321KB.
Aha. I think I see why this bothers you so much. There is *1* DHT
lookup per file, /not/ 1 per block.
When a node first joins the flud network, it does the standard DHT
join op which populates its own routing tables. It also caches many
of the nodes that it encounters between runs. During its first backup
op, it chooses from among those nodes already in its local cache. The
tables may fill out more as storage keys are looked up or stored.
Success on initial storage/verify ops creates trust, and the node
naturally starts to prefer nodes that it has some history with.
Now, during restore it might have to do a lookup on some of the
nodeIDs where blocks are stored. But this shouldn't be done very
often, since once it finds a node it will again cache it, and since
most of a node's trading is done with a small subset of peers. It will
not need to query the DHT very often for block locations. It will
continue to do 1 DHT query for each file lookup.
Of course there are pathological cases where this would degenerate
into 1 lookup per block, but those are pretty contrived.
And, yes, for files it chooses to store convergently with
already-stored blocks, it may have to do a few lookups. But I think
that is an very fair tradeoff.
> You mentioned the
> erasure coding splits files into 20 pieces. How big are those 20 pieces
> relative to the filesize? How many pieces can be lost and still recover the file?
They are 1/20th the filesize. Currently, there are also 20 parity
blocks created. So 1/2 (20) of all blocks can be lost and the file is
still recoverable. As I mentioned previously, these numbers will need
to change as we learn what the right values are from usage in testing,
but with the trust system in place, I tend to think that they are
overly pessimistic, i.e., we may be able to get away with a lot less
erasure coding overhead in practice.
> Sounds very reasonable. Although I see nothing that would preclude flud
> being used as an archive service and an arbitrary number of files, versions,
> etc. As long as the storage node is around to "pay" for that storage.
If a node stops doing verify ops, the storing node can discard the
data. In order to do verify ops, a node must possess a copy of the
Hrm, except that isn't exactly true currently... if the initiating
node knows how large the file is, it can do correct verify ops without
the storing node detecting that it has faked it. Looks like the
verify op needs to be modified to do a 2-way challenge. This should
be fairly trivial.
> So say 10 peers need to store 1/20th of ubuntu i386 feisty /bin/login and they
> all ask the DHT for someone with the sha256 of that 1/20th, all find the same
> storage node (which we can call S) and all negotiate for the storage.
Minor correction: the 10 peers will ask the DHT for the block
metadata for the file and get back the list of nodes where each of its
blocks are stored.
> some nodes upgrade to gutsy, run a backup. Would you expect those nodes to
> ask S to store the new block? Or just troll the DHT for someone to exchange
> data with?
If gutsy's /bin/login is a different file than feisty's /bin/login
(which it most likely will be), then the first node to back it up will
find that no one has stored it yet, store it, and then the others can
either choose to enter into trading relationships with the nodes that
the first storer chose, or they can enter into new relationships with
nodes of their own preference.
> Maybe just switch to whole files if smaller than 8KB?
Yes, that's a valid strategy, and I almost did it early on. In
order to get the same survivability as our current scheme (where we
tolerate 20 failures), we'd need to store 21 copies of the small file.
That does reduce the metadata size by almost half, and might be worth
it. But it didn't seem to be much of a win on the fileset I compared
with (see the earlier mailing list post that had some numbers for
This could be added later.
> I think I understand most of flud, at least at the 50k foot level, I'm just
> wrestling having agreements with:
> * potentially (files in convergent storage * 20) peer agreements
Yeah... I really don't see how that could ever happen in practice.
> * a local dht store of:
> (average files per peer) * 20 (erasure coding blocks) * 12 (DHT replication
> factor) records.
It's files*12 records.
> * what fraction of an average users backup is going to be found in convergent
> storage that wouldn't be found among a small group of storage peers.
Hard to say. My intuition is that unless you group users/systems by
similarity, you're not going to see that much advantage to convergent
storage for smaller groups.
> Seems like the easy to share tools are the DHT (if you aren't using one of
> the existing DHT implementations), the erasure coding (ditto). Speaking
> of which is the erasure coding written in Python? How long does it take
> to erasure code a 1MB file? 10MB? 100MB? I did some benchmark tests with
> DIBS erasure coding vs some random C library I found. For identical coding it
> was something like 50 times faster (my memory is fuzzy, I'll try to look up
> the timings if there is interest).
The DHT is custom, but mostly stand-alone. flud recently switched
to zfec for erasure coding (created by zooko for Tahoe). Its
performance is quite satisfactory compared to rizzo, but of course has
to pay the python wrapper penalty. You can find benchmarks in the
More information about the flud-devel