[flud-devel] DHT Performance? Design?

Bill Broadley bill at broadley.org
Tue Oct 30 23:32:05 PDT 2007


Alen Peacock wrote:
>   Right.  Master metadata in flud does actually also store a timestamp
> (I just updated that sentence at Story_of_a_File).  If a file changes
> in flud, it's sK changes too.  So all modifications to files are
> detected locally.

Cool, I wrestled with the trust the timestamp (and the integrity of the
hardware, os, and drivers vs actually computing the filechecksums.  Then
decided to be lazy, AFAICT almost all backup software trusts the timestamps.

>   I started out with the same assumption, and that's why support for
> versions via delta compression will come later.  But I've come to
> believe that it really is a must-have, because there are a few types
> of monster files that get modified all the time.  Databases, for
> example.

Yes, a delta compressed database is a much smaller upload than the entire 
database, then again neither is guaranteed to be consistent.  Seems like
the best way to backup a database is a snapshot/dump.  Doing effectively
a diff /var/mysql /var/yesterday/mysql will require less data be encoded
but doesn't insure the integrity of either.

>  And any application which uses a local database.  On
> Windows, one of the most notorious examples is Outlook, which backs
> all emails with a DB, and which can grow to multiple GB for some users
 > Since storage keys wouldn't be enough to see if a file changed without
 > talking to the DHT.

Yes, but by my understanding is basically invisible to a python backup program
calling opendir, readdir, open(file) and friends.  So if you end up writing
a special program that can handle open files or recognizes the database and
snapshots it why not have it dump to something the equivalent of maildir.

In any case I didn't mean to argue against delta compression, just wanted
to mention the pitfalls.

> (I've seen some as big as 80GB).  Backing up this entire file each
> time a single new email arrives is unworkable.

Agreed.

> I have several apps on
> my Linux desktop that similarly update db files while they run
> (luckily, none of them are as large as Outlook yet).

The "right" way is to handle backups explicitly, like maybe select * from 
table where records>time_of_last_backup.  I've definitely seen issues from
mysql and plone/zope when active and backed up as just a file.  I hope 
subversion is more backup friendly, but I've yet to check.

Alas outlook brings up an ugly detail, afaik open files are mostly invisible 
under windows.  At least backuppc punts and says "run a util to dump it,
then run backuppc".

>   That's an interesting idea, particularly the "meta-idea" of allowing
> the backup app to notice different classes of files through inference.
>  I like it.

Sure, basically I kept track of replications per pathname and replications
per sha256 so I could manage versions of the same name/path and replications
(for those files that change rarely).  Replacing replications with erasure 
coded piece should be relatively straight forward.

BTW, my method of:
<backup client> I need the files with these checksums backed up...A,B,C
<backup peer> Oh, I only need B
<backup client> send B

Also was connected to the notion of:
<backup beer> I'm storing A for 2 other peers, so I'll charge you for 1/3rd
               of the storage to store A.  I'm store C for 7 other peers I'll
               charge you for 1/8th.

I run a mirror for centos and ubuntu and a tripwire like setup based on that
info and the sha256's I collect that let you know the checksums of known good
files without having to try to chase the ever changing checksums with every
patch and package installation.  It's currently at 8.2M files and 166k 
packages.  Once things stabilize I was going to make it a charity peer for
the p2p network that would allow "free" restores for any checksum it happened
to have a file for.  Such methods (binary checksums) work rather nicely with
Dom0's and LVM snapshots to insure integrity (i.e. no failing disks) as well
as integrity (i.e. no rootkits).  Turns out my scanner client (basically 
sha256 checksums + python glue) and scanner server (sql schema for releases, 
packages, files, and releases) is pretty similar to much of what is needed for 
a backup client. I had decided on some kind of sql representation of metadata, 
I wanted many of the things I assume with SQL, thread safety, foreign keys, 
ability to append/delete willy nilly, snapshots, efficient selects because
of indexes, not having to have data fit in ram, etc.  Encryption, erasure 
codes, delta compression, cryto checksums, and huge numbers of 
files/directories end up being fairly cpu intensive and kinda screams out for 
threads, especially now that even a $1200 desktop these days might have 4 
cores.  Let alone a $3k server.

>   The file block metadata should be fairly evenly distributed among
> the nodes in the DHT, yes.  But since file data (and filesystem
> metadata) is distributed by symmetry and choice, it is likely that a
> flud node will tend to prefer to enter into trading contracts with a
> fairly limited number of other nodes.  See
> http://www.flud.org/wiki/Fairness

Hrm, so what makes the metadata in the DHT a good idea?  Storing potentially
fairly large (on average one peers metadata*12?) amount of data from folks you
don't have a vest interest in (nor they you).  Disaster recovery can be done
via more direct methods.  Not like the metadata does you much good if your
peers reject your contracts/block stores.

>   Thanks for the reference, I hadn't heard of xar.  I notice that it
> stores its toc at the beginning of the file, which is very nice for
> quick access.  It has a downside, however.  flud does a lot of
> aggregating to existing archives.  With tar, that is a simple append
> operation.  If the toc were required to be at the beginning, this
> would require a full archive rewrite each time a file is added to an
> existing archive, no?  Maybe xar has a mode of operation to deal with
> this.

I've downloaded it and compiled it.  I dont' see an easy append.  Dar looks
somewhat similar.  The big win with Xar is for things like cross platform
issues like ACLs and resource forks.  Seems what really is needed is the xar
magic for metadata with pointers into content addressable storage.

>   Currently, the replication factor is set at 12, but I've done no
> work to determine if this is sufficient.
> 
>   flud's trust metrics do not extend to the DHT.  Some of the reasons
> are outlined here:
> http://flud.org/pipermail/flud-devel_flud.org/2006-July/000008.html,

I just read that, and I see the reasons why the DHT might not be as vulnerable
as one might think.  But I believe the metadata is also in the storage blocks
right?  So if you maintain bidirection reputation/contract monitoring with 
your storage peers what does any interaction with the DHT get you (besides new
contracts)?


> others here: http://flud.org/pipermail/flud-devel_flud.org/2006-September/000013.html
> (short summary: if the DHT fails, nodes can still operate correctly
> including restoring data, and  in a source-routed DHT like kademlia,
> simple blacklisting is sufficient for disgarding misbehaving nodes).
> Note that both these threads also address the issue of DHT abuse by
> malicious nodes.

So it doesn't sounds like storing 12 times the metadata on each node is
proving much benefit.

>   The one flaw with convergent storage (aka SIS:
> single-instance-store, aka CAS: content-addressable storage) is that
> it can reveal that others have copies of well-known files.  This is

Less so if you ditch metadata in the DHT, only you contractual peers
would be concerned.  So each client would only store on average:
* somewhere around log(n) records for DHT routing
* somewhere around DHT_replication * public_key <-> IP mappings (to support
   dht lookups for peer -> IP).
* contract info and reputation per storage peer (likely in the 20-100 range?)
* metadata for local files (pathname, timestamp)
* metadata for remote blocks (pathname, peer, and slice)

In the normal backup case you access local metadata and sha256sum,
encrypt, erasure code, and upload your churn.

In disaster recovery you load your keys and wait, or maybe 
put_dht(public_key)=help I lost my disk

Speaking of which a peer failing to inquire about your reputation should
cause them to lose reputation.

> partly addressed in flud by not ever revealing the list of 'owners' of
> a block of data to external nodes, i.e., when a node is queried for
> block ownership info, it is first challenged to prove its identity
> (self-certifying), and then the answering node never reveals who else
> might also own that block.  This is insufficient, of course, as
> malicious nodes can violate that safe-guard, and a node can always
> determine who is storing which blocks on it locally.

Right, as well if you wanted to watch activity on a single file you could join 
the DHT with the ID of the file you were interested in.  Then any DHT
lookups would go to you.  Oh the replication, so you would need the nearest
12 IDs.

>   Encrypting before determining sK solves that, to the disadvantage of
> convergent storage.

Indeed.

>   Note that for the truly paranoid, pre-sK encryption would also be
> desirable for non-personal files.  E.g., backing up a copy of my
> iTunes library might allow Apple or the RIAA to discover that I've
> made "illicit" copies of these files into the backup cloud.

Indeed.


>   Kind of.  The file block metadata is stored on the k nodes closest
> to sK, but a node is entirely free to choose which nodes it stores the
> actual data/fs-metadata on.

Ah, I had mistakenly assumed the dht peer for a key was the block store
as well.  I'm learning.

>  And this is likely to be a limited number
> of peers [http://www.flud.org/wiki/Architecture#Data_Storage].

Reread.

>   Yes.  One technique that seems useful (and that I think first
> appeared in the Samsara paper, but I could be mis-remembering) is to
> probabalistically discard data for misbehaving nodes, with some sort
> of exponential ramp-up.  This works very well with a system that does
> erasure coding, because it means that if a node goes down, it has some
> time to come back up before any of its data will become unrecoverable
> due to peers punishing it for being unreliable.

Well contracts should have a length, and someone that performs well
should be rewarded with better terms on a contract.  Actions speak much
louder than words.  So new peers don't get a huge chunk of disk, nor a very 
long terms.  But when it comes to renewing, those who performed well
get bigger chunks and longer terms.  So once things reach a steady state
folks will have say 20 peers that would each wait say 2 weeks before throwing
away 10% of their files a day.

Maybe a suggested 256MB/256MB trade 72 hour contract?

If a particular peer was perceived as abusing you before you might only offer
a 0/256MB trade (I.e. you use their disk and you don't offer any space), and
let them earn the trust for you to store with them on more even terms.

>   Ah, yes.  That seems like a tidy way to handle it.  I like this very
> much.  Any thoughts on how long nodes keep trying to do verify ops
> before giving up?  A week?  A month? 

I'd say definitely as long as the contract, and double that to be friendly.

> It's kind of a hard parameter to
> settle on, because it is a system global and can't really be
> configured on a per-system basis without a lot of complexity.  I'm
> leaning towards something on the order of 10 days myself.

Not sure I see the complexity, all contracts should be monitored during the 
terms of the contract.  If I have a contract with a 20 folks on internet 2 for 
32Gb/32GB for a month I don't see why folks on dsl with 256MB/256MB for 3 days 
would care.

Hrm, maybe bandwidth should be mentioned in the contract.  Kinda silly to 
offer 100GB for a week if your bandwidth can only manage 1GB over that time.

>   Undoubtedly.  I've really strived for simplicity, but you get a
> couple years into this stuff and things tend to get a bit more
> complicated than you first expected.  The striking thing is that in
> order to provide a really good user experience, there is a lot more
> complexity to still add.

Agreed, It all looks justified to me, a DHT can be very useful, erasure
codes give more replication for the overhead, basically this is all
the penalty of not trusting your peers, er trusting that no more than
60% (or whatever your ratios for the erasure coding is) are evil.

>   I really suspect the true bottleneck is going to be uploading 150GB.

As it should be.  There are folks out there with 100 mbit connections
to the internet, so 150GB in a few hours isn't unreasonable.

>  Keep in mind that flud isn't going to try to do all 345k files at
> once; it will do a few files at a time and make steady progress.  It
> should also be noted that the DHT lookup ops can overlap with file
> uploads.  flud subscribes to the asynchronous networking model
> (relying heavily on twisted python), so it is not a big deal to have
> 1000s of outstanding ops going concurrently.

Cool, I'm a big fan of async and threads.

>   I fear that I haven't done fair service to many of your questions,
> but hope this has been helpful as you think about how you would like
> to proceed with your design.  I look forward to hearing more about
> your progress and thoughts.

I'm getting there, I think the only piece misssing is the cost/benefit to 
putting the metadata again in the DHT.



More information about the flud-devel mailing list