[flud-devel] DHT Performance? Design?
alenlpeacock at gmail.com
Sun Oct 28 08:03:36 PDT 2007
On 10/27/07, Bill Broadley <bill at broadley.org> wrote:
> Greetings all,
> I've been tinkering with the idea of writing a p2p backup client with the
> usual goals:
> * Don't trust your peers more than you have to (encrypt blocks before you
> * Donate storage to get storage
> * Find redundancy when feasible
> * Use DHT for finding peers/contracts, possibly wait for version 2
> * open source
> * Challenge peers to verify they stored what they claimed
> * Public keys and SHA256 checksums are very useful to insure who you are
> talking to, identifying duplicate files, etc.
> * at least for version 1 I was going to use simple replication and
> recognize duplicates instead of erasure codes.
> * only private and public key should be necessary for disaster recovery
Greetings Bill! Sounds like you've put a good deal of thought into
this. The list for flud would be strikingly similar.
I'll answer your questions as well as I can inline, but please
followup if something is unclear or [continues to] seem foolhardy ;)
> While a DHT is a very cool technology and very aesthetically pleasing I
> wasn't sure that it was up to the task of a query per file for a large number
> of peers distributed over the internet. Does anyone have numbers handy?
> Things like azueus use it for finding peers, but that's relatively little
> info that is just a single query (who's on this torrent) that seems like
> it often takes minutes.
> Flud looks awesome, I've seen many related papers (idibs and myriadstore among
> others), but precious few that actually publish source. Kudos to Alen. Of
> course I have some questions. I've tried to read as much on the website,
> p2p-hackers, and the flud mailing list as I could find. RTFM is welcome,
> please include a URL ;-).
The wiki and READMEs (and mailing list) are the only real manual we
have right now. This is a known deficiency which I would love to
remedy. I'll include links as much as possible, but I'd love to hear
more about how difficult it is to find the desired information so that
I can try to make it better.
I've been wanting to write a couple of papers that would organize
wiki info more sanely and do some gapfilling for a long time. Maybe
it is time to finally do that.
> As an example my desktop uses 150GB, 50k directories, 345k files, average file
> size of 478KB.
> #1 Does every backup require 150GB to be read and encrypted then 345k queries
> of the DHT? I found a mention of only filename and storage keys
> stored locally.
Let me just confirm upfront that backup only occurs for modified
files, and the "bare-metal" restore use case is not planned to be
supported (at least not initially). Support for delta-compression is
planned at some future date
If a user has 150GB of data in 345k files to backup initially, this
is going to take a while. But that is true of most online backup
services currently -- for most users, backing up this much data over
their broadband connection will take weeks, even for systems that
don't use a DHT.
I don't currently have performance numbers for how long it would
take to do 345k queries with, say, 1000 nodes, but you are right in
pointing out that if such queries are a bottleneck, that will be a
very bad thing, and we will have to look at ways to improve
performance. I'm glad you've brought it up, its a good reminder that
we need to be doing more performance evaluation of raw ops upfront.
> #2 Is this a new DHT? If not which is it? Does it related to an existing
It is a kademlia-like DHT.
> #3 how big is a block?
flud blocks are not fixed size. Each block is currently 1/20th of
full filesize (this number is dependent on the erasure coding scheme,
which can be modified). We *are* currently doing some smart things to
make this non-ouchy for filesystem storage overhead for small files.
flud uses a filesystem overlay to aggregate small files on receiving
nodes (currently using plain old compressed tar, but that can be
switched out seemlessly for something more sophisticated in the
The other consequence of this choice is that large blocks can take a
long time to transfer, and thus have a better chance of not finishing
due to disruption. Being able to resume storage operations without
starting over if a transfer is interrupted is essential. flud
currently does nothing for this case, but it certainly will need to at
> #4 for 1000 peers how many nodes (on average) have to be contacted to query
> the DHT for sK? log n hops? Log n neighbors stored on each peer?
In kademlia-style DHTs, the answer depends on a number of factors,
including replication value and bucket depth, but log(n) is the
appropriate worst-case answer.
> #5 Say you have 1000 flud users, each with 345k files, how long would it take
> for 1000 peers to make 345k queries, answer 345k queries from
> it's peers, and handle DHT routing (number of files * log n?) ?
This is another question that I think can only be answered through
performance evaluation. We can work out the number of ops that would
be needed theoretically in this case, but a system under heavy load
might experience timeouts, thundering herds in response to failed ops,
cascading failures because of the above, etc.
> My planned approach was more along the lines of:
I'm going to go ahead and compare/contrast inline with your
psuedocode -- forgive me if this interrupts the flow too much.
http://www.flud.org/wiki/Story_of_a_File is sort of a 10,000 foot
view. I'll try to fill in some details.
> for <directories to be backed up>
> for files in directory
> # query local metadata
I should mention that all metadata in flud is cached locally after
being stored to the DHT. Nodes needn't query the DHT again for a
specific file's metadata (except to refresh it) when they want to
> if (replicated(file)<desired_replication(directory))
> elseif (replicated(file)>desired_replication(directory))
This is very similar to what flud does for a file when it detects
missing coded blocks.
> Then a second thread/process:
> for <queue_for_encryption_and_send>
> if (policy.security(file)==secure) #less storage efficient, more secure)
> #file duplicate will only be within this key, i.e. the sysadmin
> #responsible for a group of machines will benefit from duplicates
> #within his machines
> elseif (policy.security(file)==less_secure) #more storage efficient
> # less secure this allows a peer you have a contract with to find out
> # if you have a given file by sha256'ing it and see if it matches
I've actually been planning to do something similar later on:
allowing paranoid users to encrypt files before computing their
storage key. It may even become the default mode of operation at some
point, but I haven't convinced myself that it is necessary yet.
> peer=find_peer(file) # find a peer with a contract for enough space
> # and isn't already storing a previous replication of said
> # file
> #track file, stat, sha256sum, and which peers locally
> <backup host> I need this files stores <list of sha256s>
> <selected peer> I do not have this list <potentially shorter list>
> <backup host> here's the file contents of <potentially shorter list>
> <backup host> I'll update the metadata listing you as the storage for
> those files, and update our contract.
> <selected peer> I'll update my metadata listing you as the owner (or partial
> owner and add the storage totals (or partial storage totals)
> to our contract.
flud does do aggregation of multiple storage ops to a single peer
when backing up data. See
flud.protocol.ClientPrimitves.AggregateStore. I added this, oh,
probably a year or more ago due to some early performance testing I
had done, and it improved throughput during backup dramatically. I'm
not sure that it is documented anywhere but in the code. I have
hesitated to document the protocol formally (the dead links at
http://www.flud.org/wiki/FludProtocol) because it is still changing
> After a backup is finished a copy of all metadata (filename and stat related
> info) will be sent to each peer for all files that are stored on that peer.
flud does this too, but with a different technique. All
filesystem-specific metadata is erasure coded and stored along with
the blocks of the file among peers (not in the DHT). Each block also
is stored with enough data to identify its owner[s] (see
flud.protocol.BlockFile), so a node /can/ recover all blocks by simply
asking nodes for blocks they own, even if the DHT (which is used to do
this more efficiently) is somehow broken. The implementation is
pretty close to what was disclosed in the ramblings in this thread:
> Then for reputation:
> foreach peer
> if (max<min)
> if (sha256 == peer_sha256)
The 'verify' operation in flud.protocol performs this same check,
but it is left to the higher-order caller to increment/decrement the
trust metric. Moving the trust ops out of the primitives was a
decision I recently came to, but I think there are some
trust-metric-related operations that may still belong down in the
primitives... It might be helpful for both of us to discuss this in
more detail at some point.
Some other times when trust is incremented/decremented: during
successful/failed store/retrieve operations, when nodes that can't
self-certify their nodeIDs correctly, when nodes can't be contacted.
Some failures also may require blacklisting of ip addresses/ranges.
Trust metrics are still very wet cement in flud. The first real
working code was introduced during this last week, in fact. There are
a lot of things still to work out, so this is another area that I'd
love to have some more discussion on. I added a new reference to the
bibliography this morning that may be instructive if you haven't
already seen it:
> So some differences from flud(from what I can tell), the p2p cloud will have
> traffic for the following:
> * filesystem churn resulting in peer <-> peer file transfers
> * DHT traffic for folks looking for new contracts
> * challenge/requests for managing reputation
> * occasional restores
> * in general very network efficienct
> Disadvantages vs Flud:
> * lower storage efficiency, replicated files are found only among
> select peers not the entire dht
> * lower storage efficiency, replication is less efficient than erasure codes
Well, you can certainly swap out replication for erasure codes at
some point ;)
> Disaster recovery instead of 345k DHT queries would just wait for a challenge
> from ones peers. Then you could ask them for all the files they have of yours.
As I alluded to above, 345k DHT queries would not be necessary in
flud for recovery, either. If a node still has its metadata cache, it
could use that directly.
If not, a flud node could also try querying its storage-trading
peers directly instead of looking up each file in the DHT. But that
is a fallback in flud, and it is expected that querying the DHT
directly would actually be a lot easier, because in the case of
catastrophic disaster (stolen/destroyed computer), how do you figure
out who "one's peers" are without querying every single node in the
cloud? Maybe you plan to store that information in a side-channel?
All that said, I'm actually quite open to re-examining the role of
the DHT in flud. In its current incarnation, flud can operate
correctly with the DHT only serving as a peer introducer (this is due
to the BlockFile modifications done several months ago that I
described above). But I *do* expect the DHT to work out fine as
flud's first authority on where a file's blocks reside. Perhaps this
is a blind expectation; it would be more rational to base such a
belief on measured performance, so I'll have to make taking some
measurements a higher priority at some point soon.
More information about the flud-devel