[flud-devel] DHT Performance? Design?
Alen Peacock
alenlpeacock at gmail.com
Mon Oct 29 22:22:00 PDT 2007
On Oct 29, 2007 3:41 AM, Bill Broadley <bill at broadley.org> wrote:
>
> This history of a file document was very useful, I was mostly unclear
> if that was a new/modified file or all files. Similarly if all
> meta data was cached locally or only sourced from the DHT. Especially
> the:
> Update local master metadata
> 1. master metadata is simply a list of filenames to storage keys (sK)
>
> Since storage keys wouldn't be enough to see if a file changed without
> talking to the DHT.
Right. Master metadata in flud does actually also store a timestamp
(I just updated that sentence at Story_of_a_File). If a file changes
in flud, it's sK changes too. So all modifications to files are
detected locally.
> > planned at some future date
> > [http://www.flud.org/wiki/Architecture#Versioning].
>
> Ah, interesting, I had missed that before. I pondered differences, but then
> decided that most files on my system don't change enough to justify it.
I started out with the same assumption, and that's why support for
versions via delta compression will come later. But I've come to
believe that it really is a must-have, because there are a few types
of monster files that get modified all the time. Databases, for
example. And any application which uses a local database. On
Windows, one of the most notorious examples is Outlook, which backs
all emails with a DB, and which can grow to multiple GB for some users
(I've seen some as big as 80GB). Backing up this entire file each
time a single new email arrives is unworkable. I have several apps on
my Linux desktop that similarly update db files while they run
(luckily, none of them are as large as Outlook yet).
> Although this can be used to your advantage. I was planning (without erasure
> codes) to have a default of 3 replications for a file. But for changing files
> try to keep 6 versions or so. Since most files on a disk are static you end
> up with a rather small percentage of files that change often, and those files
> are often the ones that are most painful if you lose. So 3 identical copies
> of /bin/login and 6 versions of Todays-thesis.tex.
That's an interesting idea, particularly the "meta-idea" of allowing
the backup app to notice different classes of files through inference.
I like it.
> > It is a kademlia-like DHT.
>
> Gotcha. So statistically each client will end up with approximately 1/N of
> the keys/files? Say I want to donate 150GB of disk space to get 150GB
> of backups, and someone else wants to donate/get half as much or twice as much?
The file block metadata should be fairly evenly distributed among
the nodes in the DHT, yes. But since file data (and filesystem
metadata) is distributed by symmetry and choice, it is likely that a
flud node will tend to prefer to enter into trading contracts with a
fairly limited number of other nodes. See
http://www.flud.org/wiki/Fairness
> Duplicity is having similar discussions on what to use, I think xar looks most
> likely, it's especially cross platform, is being used for packages in OSX 10.5
> as well as rpm5. Oh and it has python bindings. Random post about it:
> http://www.n8gray.org/blog/2007/05/07/xar-rox/
Thanks for the reference, I hadn't heard of xar. I notice that it
stores its toc at the beginning of the file, which is very nice for
quick access. It has a downside, however. flud does a lot of
aggregating to existing archives. With tar, that is a simple append
operation. If the toc were required to be at the beginning, this
would require a full archive rewrite each time a file is added to an
existing archive, no? Maybe xar has a mode of operation to deal with
this.
> What replication factor do you use? Placing pieces of erasure coded files
> on peers based on their key makes it rather hard to handle reputation. Unless
> it's a threshold where you just boot someone from the DHT if they aren't
> reliable enough. My plan was to handle a relatively few peers irregardless of
> the number of peers in the DHT. Just those you contract with.
Currently, the replication factor is set at 12, but I've done no
work to determine if this is sufficient.
flud's trust metrics do not extend to the DHT. Some of the reasons
are outlined here:
http://flud.org/pipermail/flud-devel_flud.org/2006-July/000008.html,
others here: http://flud.org/pipermail/flud-devel_flud.org/2006-September/000013.html
(short summary: if the DHT fails, nodes can still operate correctly
including restoring data, and in a source-routed DHT like kademlia,
simple blacklisting is sufficient for disgarding misbehaving nodes).
Note that both these threads also address the issue of DHT abuse by
malicious nodes.
> Yeah, I suspect many would be freaked out if their backup system revealed
> to other peers that they were running a vulnerable version of a binary. I'm
> not up to date on the er, I think it's called convergent encryption where
> 2 folks can encrypt the same file with 2 different keys yet somehow magically
> it's only stored once. Seemed like encrypting with the sha256 would be a good
> approximation for the lower security stuff. I was mainly thinking of system
> binaries from the installation media for this kind of thing. Then things like
> email or personal files would be encrypted with their public key.
The one flaw with convergent storage (aka SIS:
single-instance-store, aka CAS: content-addressable storage) is that
it can reveal that others have copies of well-known files. This is
partly addressed in flud by not ever revealing the list of 'owners' of
a block of data to external nodes, i.e., when a node is queried for
block ownership info, it is first challenged to prove its identity
(self-certifying), and then the answering node never reveals who else
might also own that block. This is insufficient, of course, as
malicious nodes can violate that safe-guard, and a node can always
determine who is storing which blocks on it locally.
Encrypting before determining sK solves that, to the disadvantage of
convergent storage.
Note that for the truly paranoid, pre-sK encryption would also be
desirable for non-personal files. E.g., backing up a copy of my
iTunes library might allow Apple or the RIAA to discover that I've
made "illicit" copies of these files into the backup cloud.
> Hrm, so encrypted files are erasure coded and stored on the peer with an ID
> closest to the storage key? So for a large number of keys there's likely to
> be relatively few keys stored on a single peer? I got around this by just
> assuming folks would ask for 1-10GB contracts and only talk to 10-20 peers.
Kind of. The file block metadata is stored on the k nodes closest
to sK, but a node is entirely free to choose which nodes it stores the
actual data/fs-metadata on. And this is likely to be a limited number
of peers [http://www.flud.org/wiki/Architecture#Data_Storage].
> Indeed, it's a tough problem. Especially since you want to protect a peer
> that lost a disk and wants to restore files.
Yes. One technique that seems useful (and that I think first
appeared in the Samsara paper, but I could be mis-remembering) is to
probabalistically discard data for misbehaving nodes, with some sort
of exponential ramp-up. This works very well with a system that does
erasure coding, because it means that if a node goes down, it has some
time to come back up before any of its data will become unrecoverable
due to peers punishing it for being unreliable.
> Well in my case you only have a fairly small number of peers (instead of
> a large fraction of your DHT). In the process of monitoring your
> storage peers (those who you have a contract with) you challenge them
> periodically. Of course your storage peers do the same of you. So in
> the case of disaster you just run the client, load your private/public
> key, and the client posts your public key to the DHT. As soon as your
> storage peers start challenging you you reply with please send me
> all block you stored for my public_key.
Ah, yes. That seems like a tidy way to handle it. I like this very
much. Any thoughts on how long nodes keep trying to do verify ops
before giving up? A week? A month? It's kind of a hard parameter to
settle on, because it is a system global and can't really be
configured on a per-system basis without a lot of complexity. I'm
leaning towards something on the order of 10 days myself.
> I'll check it out, it's a bit much to digest in a sitting.
Undoubtedly. I've really strived for simplicity, but you get a
couple years into this stuff and things tend to get a bit more
complicated than you first expected. The striking thing is that in
order to provide a really good user experience, there is a lot more
complexity to still add.
> I did dig
> up a DHT paper:
> Improving Lookup Performance over a Widely-Deployed DHT
> http://mirage.cs.uoregon.edu/pub/infocom06-kad.pdf
>
> It analyzes Kad which is a kademilia based DHT in wide use. If I'm
> understanding it right it looks like an average query is 10 seconds or so
> but if you do them in parallel you can get it down to 2-3 seconds.
> (top of page 8). I've yet to carefully read the paper though, so I might
> be assigned completely incorrect ideas to the graph.
Very nice paper. I have some notes on it at
http://www.flud.org/wiki/RelatedPapers . It's been too long since I
last looked at it, though. I'll need to revisit.
> Sure, nothing speaks to real world performance like real world
> performance ;-). Assuming 2 seconds is reasonable estimate 345k
> files * 2 = 8 days. Then again assuming 384kbit (average DSL
> upload around here) is an average upload speed 150GB is going
> to take quite awhile. Your small file coalescing might help
> substantially as well.
I really suspect the true bottleneck is going to be uploading 150GB.
Keep in mind that flud isn't going to try to do all 345k files at
once; it will do a few files at a time and make steady progress. It
should also be noted that the DHT lookup ops can overlap with file
uploads. flud subscribes to the asynchronous networking model
(relying heavily on twisted python), so it is not a big deal to have
1000s of outstanding ops going concurrently.
I fear that I haven't done fair service to many of your questions,
but hope this has been helpful as you think about how you would like
to proceed with your design. I look forward to hearing more about
your progress and thoughts.
Alen
More information about the flud-devel
mailing list