[flud-devel] DHT Performance? Design?
bill at broadley.org
Mon Oct 29 02:41:15 PDT 2007
Alen Peacock wrote:
> Greetings Bill! Sounds like you've put a good deal of thought into
> this. The list for flud would be strikingly similar.
Indeed, even more so after your additional explanations, thanks.
> I'll answer your questions as well as I can inline, but please
> followup if something is unclear or [continues to] seem foolhardy ;)
> The wiki and READMEs (and mailing list) are the only real manual we
> have right now. This is a known deficiency which I would love to
> remedy. I'll include links as much as possible, but I'd love to hear
> more about how difficult it is to find the desired information so that
> I can try to make it better.
> I've been wanting to write a couple of papers that would organize
> wiki info more sanely and do some gapfilling for a long time. Maybe
> it is time to finally do that.
This history of a file document was very useful, I was mostly unclear
if that was a new/modified file or all files. Similarly if all
meta data was cached locally or only sourced from the DHT. Especially
Update local master metadata
1. master metadata is simply a list of filenames to storage keys (sK)
Since storage keys wouldn't be enough to see if a file changed without
talking to the DHT.
> Let me just confirm upfront that backup only occurs for modified
>, and the "bare-metal" restore use case is not planned to be
> supported (at least not initially). Support for delta-compression is
Yeah, bare-metal wasn't high on my priority list either. I figure most
people install from install media and then backup /home via the net.
> planned at some future date
Ah, interesting, I had missed that before. I pondered differences, but then
decided that most files on my system don't change enough to justify it.
Sure if you have email folders (unless you use maildir), source code, and
documents your editing often respond well to diffs. But I suspect > 95% of
files of > 95% of people don't respond well to diffs. Photos, music, tv,
games, browser cache, google earth data files, fonts, system binaries, etc.
Although this can be used to your advantage. I was planning (without erasure
codes) to have a default of 3 replications for a file. But for changing files
try to keep 6 versions or so. Since most files on a disk are static you end
up with a rather small percentage of files that change often, and those files
are often the ones that are most painful if you lose. So 3 identical copies
of /bin/login and 6 versions of Todays-thesis.tex.
> If a user has 150GB of data in 345k files to backup initially, this
> is going to take a while. But that is true of most online backup
> services currently -- for most users, backing up this much data over
> their broadband connection will take weeks, even for systems that
> don't use a DHT.
Correct, I should have been more clear, I meant after the network had
a full backup.
> I don't currently have performance numbers for how long it would
> take to do 345k queries with, say, 1000 nodes, but you are right in
> pointing out that if such queries are a bottleneck, that will be a
> very bad thing, and we will have to look at ways to improve
> performance. I'm glad you've brought it up, its a good reminder that
> we need to be doing more performance evaluation of raw ops upfront.
Caching local metadata should help hugely, except of course in the case of a
full restore, but those should be rare. There is something called a sloppy
distributed hash table which can avoid the normally high latencies DHTs seen
(mainly because random keys lead to random network placement which is somewhat
of a worst case for a network). Alas, it's not clear to me that it would help
in flud like usage.
> It is a kademlia-like DHT.
Gotcha. So statistically each client will end up with approximately 1/N of
the keys/files? Say I want to donate 150GB of disk space to get 150GB
of backups, and someone else wants to donate/get half as much or twice as much?
>> #3 how big is a block?
> flud blocks are not fixed size. Each block is currently 1/20th of
> full filesize (this number is dependent on the erasure coding scheme,
> which can be modified). We *are* currently doing some smart things to
> make this non-ouchy for filesystem storage overhead for small files.
Indeed, small files are frustratingly common. I've pondered numerous
solutions to this. Trying to balance the chances of finding an exact
replica among your peers, efficiency in processing and metadata
storage. A block layer independent of files is one common approach. Another
is to make directories below a threshold size into a single file (.tar or
related). I think dibs (fairly similar, python, erasure codes, etc) sets
a minimum threshold for filesize before it will use the erasure codes.
> flud uses a filesystem overlay to aggregate small files on receiving
Ah, good to hear.
> nodes (currently using plain old compressed tar, but that can be
> switched out seemlessly for something more sophisticated in the
Duplicity is having similar discussions on what to use, I think xar looks most
likely, it's especially cross platform, is being used for packages in OSX 10.5
as well as rpm5. Oh and it has python bindings. Random post about it:
> In kademlia-style DHTs, the answer depends on a number of factors,
> including replication value and bucket depth, but log(n) is the
> appropriate worst-case answer.
What replication factor do you use? Placing pieces of erasure coded files
on peers based on their key makes it rather hard to handle reputation. Unless
it's a threshold where you just boot someone from the DHT if they aren't
reliable enough. My plan was to handle a relatively few peers irregardless of
the number of peers in the DHT. Just those you contract with.
So currently 345k files would turn into 345k * 20 blocks = 6.9M or so blocks,
so 6900 blocks per peer if 1000 peers? 1 block per peer if there's 6.9M peers?
>> #5 Say you have 1000 flud users, each with 345k files, how long would it take
>> for 1000 peers to make 345k queries, answer 345k queries from
>> it's peers, and handle DHT routing (number of files * log n?) ?
> This is another question that I think can only be answered through
> performance evaluation. We can work out the number of ops that would
> be needed theoretically in this case, but a system under heavy load
> might experience timeouts, thundering herds in response to failed ops,
> cascading failures because of the above, etc.
Heh, well I don't think it's reasonable to assume 100% new users or 100%
peer restores. So the local metadata means that with 1000 users you just
have to handle the churn which seems like it would work just fine.
> I should mention that all metadata in flud is cached locally after
Ah, big difference, that removes much if not all of my concerns.
>> Then a second thread/process:
>> for <queue_for_encryption_and_send>
>> if (policy.security(file)==secure) #less storage efficient, more secure)
>> #file duplicate will only be within this key, i.e. the sysadmin
>> #responsible for a group of machines will benefit from duplicates
>> #within his machines
>> elseif (policy.security(file)==less_secure) #more storage efficient
>> # less secure this allows a peer you have a contract with to find out
>> # if you have a given file by sha256'ing it and see if it matches
> I've actually been planning to do something similar later on:
> allowing paranoid users to encrypt files before computing their
> storage key. It may even become the default mode of operation at some
> point, but I haven't convinced myself that it is necessary yet.
Yeah, I suspect many would be freaked out if their backup system revealed
to other peers that they were running a vulnerable version of a binary. I'm
not up to date on the er, I think it's called convergent encryption where
2 folks can encrypt the same file with 2 different keys yet somehow magically
it's only stored once. Seemed like encrypting with the sha256 would be a good
approximation for the lower security stuff. I was mainly thinking of system
binaries from the installation media for this kind of thing. Then things like
email or personal files would be encrypted with their public key.
> flud does do aggregation of multiple storage ops to a single peer
> when backing up data. See
> flud.protocol.ClientPrimitves.AggregateStore. I added this, oh,
> probably a year or more ago due to some early performance testing I
> had done, and it improved throughput during backup dramatically. I'm
> not sure that it is documented anywhere but in the code. I have
> hesitated to document the protocol formally (the dead links at
> http://www.flud.org/wiki/FludProtocol) because it is still changing
Hrm, so encrypted files are erasure coded and stored on the peer with an ID
closest to the storage key? So for a large number of keys there's likely to
be relatively few keys stored on a single peer? I got around this by just
assuming folks would ask for 1-10GB contracts and only talk to 10-20 peers.
>> After a backup is finished a copy of all metadata (filename and stat related
>> info) will be sent to each peer for all files that are stored on that peer.
> flud does this too, but with a different technique. All
> filesystem-specific metadata is erasure coded and stored along with
> the blocks of the file among peers (not in the DHT). Each block also
> is stored with enough data to identify its owner[s] (see
> flud.protocol.BlockFile), so a node /can/ recover all blocks by simply
> asking nodes for blocks they own, even if the DHT (which is used to do
> this more efficiently) is somehow broken. The implementation is
> pretty close to what was disclosed in the ramblings in this thread:
Interesting, still not quite understand enough to put all that
>> Then for reputation:
>> foreach peer
>> if (max<min)
>> if (sha256 == peer_sha256)
> The 'verify' operation in flud.protocol performs this same check,
> but it is left to the higher-order caller to increment/decrement the
> trust metric. Moving the trust ops out of the primitives was a
> decision I recently came to, but I think there are some
> trust-metric-related operations that may still belong down in the
> primitives... It might be helpful for both of us to discuss this in
> more detail at some point.
> Some other times when trust is incremented/decremented: during
> successful/failed store/retrieve operations, when nodes that can't
> self-certify their nodeIDs correctly, when nodes can't be contacted.
> Some failures also may require blacklisting of ip addresses/ranges.
> Trust metrics are still very wet cement in flud. The first real
> working code was introduced during this last week, in fact. There are
> a lot of things still to work out, so this is another area that I'd
> love to have some more discussion on. I added a new reference to the
> bibliography this morning that may be instructive if you haven't
> already seen it:
Indeed, it's a tough problem. Especially since you want to protect a peer
that lost a disk and wants to restore files.
>> So some differences from flud(from what I can tell), the p2p cloud will have
>> traffic for the following:
>> * filesystem churn resulting in peer <-> peer file transfers
>> * DHT traffic for folks looking for new contracts
>> * challenge/requests for managing reputation
>> * occasional restores
>> * in general very network efficienct
>> Disadvantages vs Flud:
>> * lower storage efficiency, replicated files are found only among
>> select peers not the entire dht
>> * lower storage efficiency, replication is less efficient than erasure codes
> Well, you can certainly swap out replication for erasure codes at
> some point ;)
Sure, well it's only a design, I'm mostly tinkering with the schema to try
to efficiently handle the types of metadata queries I need while supporting
either rebuilding the metadata database or just backing it up with the
users files. Basically a battle between good enough (and time to implement)
and striving for something better. Glad to see flud is well past the
hand waving stage.
> As I alluded to above, 345k DHT queries would not be necessary in
> flud for recovery, either. If a node still has its metadata cache, it
> could use that directly.
> If not, a flud node could also try querying its storage-trading
> peers directly instead of looking up each file in the DHT. But that
> is a fallback in flud, and it is expected that querying the DHT
> directly would actually be a lot easier, because in the case of
> catastrophic disaster (stolen/destroyed computer), how do you figure
> out who "one's peers" are without querying every single node in the
> cloud? Maybe you plan to store that information in a side-channel?
Well in my case you only have a fairly small number of peers (instead of
a large fraction of your DHT). In the process of monitoring your
storage peers (those who you have a contract with) you challenge them
periodically. Of course your storage peers do the same of you. So in
the case of disaster you just run the client, load your private/public
key, and the client posts your public key to the DHT. As soon as your
storage peers start challenging you you reply with please send me
all block you stored for my public_key. I've pondered a special case
for the last block uploaded to a peer, presumably it would always be
the / entry (for a checksum based copy on write file system and change
triggers meta data changes up to and including the root). This would
allow disaster recovery to quickly get all the metadata back so you
could prioritize your most important files.
> All that said, I'm actually quite open to re-examining the role of
> the DHT in flud. In its current incarnation, flud can operate
> correctly with the DHT only serving as a peer introducer (this is due
> to the BlockFile modifications done several months ago that I
> described above).
I don't think I understand the blockfile modification, but I'll review
the related information again before I ask any more questions.
I'll check it out, it's a bit much to digest in a sitting. I did dig
up a DHT paper:
Improving Lookup Performance over a Widely-Deployed DHT
It analyzes Kad which is a kademilia based DHT in wide use. If I'm
understanding it right it looks like an average query is 10 seconds or so
but if you do them in parallel you can get it down to 2-3 seconds.
(top of page 8). I've yet to carefully read the paper though, so I might
be assigned completely incorrect ideas to the graph.
> But I *do* expect the DHT to work out fine ash
> flud's first authority on where a file's blocks reside. Perhaps this
> is a blind expectation; it would be more rational to base such a
> belief on measured performance, so I'll have to make taking some
> measurements a higher priority at some point soon.
Sure, nothing speaks to real world performance like real world performance
;-). Assuming 2 seconds is reasonable estimate 345k files * 2 = 8 days. Then
again assuming 384kbit (average DSL upload around here) is an average upload
speed 150GB is going to take quite awhile. Your small file coalescing might
help substantially as well.
More information about the flud-devel