[flud-devel] DHT justifications (was: DHT Performance? Design?)

Bill Broadley bill at broadley.org
Fri Nov 2 04:03:30 PDT 2007


Ah, I've seen the light, the number of things we are in violent agreement on 
is increasing as well as my understanding of the reasons behind those 
decisions.  It all makes sense.

I've pondered how to quantify some of my guesses as to the average filesystem,
things like the distribution of filesizes, quantifying churn rates, and the % 
of files in common.  It's fairly trivial, things like sha256 deep and 
uploading counts into a central SQL database is simple.  Things like family 
photos, personally encoded MP3s, anything from itunes (even the non-drm stuff 
is fingerprinted I believe) isn't going to be shared among any peers.  I've no 
real idea how big the average filesystem or home directory is.  Not that this 
lack of info will prevent a good p2p backup client from being written.

In my mind's eye I was thinking that with the erasure code resulting in a file 
split into 20 that I would have some small N * 20 contracted peers.  Large 
contracts for big chunks of course for big chunks of time.  I hadn't 
considered that if I had 350k files and there were 1 million peers that I 
might end up a huge number of contracts.  So over a day of checking on peers 
and backing up churn it wouldn't be surprising to not have a single contract 
change.

I did have a vague idea that the DHT might monitor the popularity of blocks
being stored, and that above a certain threshold that peers might decide
that the file was common enough that it could be counted on to be available
without a contract.  So each peer would decide that for every file to be 
backed up to do one of 4 things:
1) Encrypt with my public_key, redundancy only within this key, store
    on a contracted peer.
2) Encrypt with the sha256(file), redundancy with anyone, store on
    a contracted peer
3) verify that it's popular beyond a certain threshold, do not store it
    anywhere, just assume that I'll be able to get this because it's
    so popular.
4) publish this file to the DHT, anyone asking for the right sha256
    can have the plaintext file for free.

Candidates for #2, #3, and #4 might be /var/cache/apt, /pictures/nasa, 
/mirror/ubuntu, ~/src/gpl, /usr/bin, ~/myOpenArticleDownloads.  #1 might be 
~/mykeys, ~/mybooks, ~/mymusic, /var/spool/mail.

So each peer could at their discretion decide between:
1) pay more for storage, enjoy minimal duplicate detection
2) pay less for storage, enjoy better duplicate detection
3) pay nothing for storage, have less certainty, "store" only files that other
    people have stored
4) how generous to be to non-contracted peers in both bandwidth and the
    scope of the files they export.

After reading your mail I'm not at all sure this is better and/or more 
efficient than the blur design of contracting with each peer for potentially a 
single file.

Thoughts on peers being optionally generous?  Allowing non-contracted
with downloads for blocks they stored?

A few things I can think of with flud:
* Tons of small contracts could be painful.  Say someone need a block you
   has stored, it's 5MB, they offer a 5MB contract.  Now you want to store
   6MB... now you ask for a new contract?  Seems like you could have
   substantial storage overhead with tons of different size pieces (on average
   1/20th of the erasure coded  filesize)
* Selfish peers will try to store the most popular blocks so they can get
   tons of contracts for only the cost of storing a block once.  Not entirely
   sure if this is a bad thing or a good thing.  I'm leaning towards bad.
   Peers might only say accept contracts to store a file they think they
   can make a substantial profit on.
* this might lead to overcommited clients.  They want to backup 1GB of data,
   but they actually only have 100MB to donate, so they jocky for storing
   blocks that the peer things that they can accumulate 10 contracts for.

>   Let's say that a user in Ireland backs up the file '4 - Bubbly.mp3'
> (today's #1 Amazon.mp3 download) to the system for the first time.
> Now another user in Taiwan wants to backup the same file, but wants to
> take advantage of convergent storage.  How does their client discover
> if the file is already in the system?  They have to consult some sort
> of global mapping of file storage keys to find the location[s].  There
> are a number of ways to accomplish that using traditional centralized
> services, but this is exactly what DHTs were designed to do in a
> decentralized environment: map keys to values.

Yup, the DHT is perfect for this.

>   In your design, you aren't offerring global convergent storage,
> correct?  So the DHT doesn't fulfill this need for you.  In flud it
> does.

My eyes have been opened, indeed I was thinking of mostly just spotting
redundancy among contracted peers.

>   There are certainly benefits to not doing global convergent storage
> (besides not needing a dht), and I'm sure you have thought of some of

I can't imagine a decentralized p2p backup system with out some form
of dht.

> them.  But there are also diminishing returns on doing convergent

The 2 I can think of is limitations of files per second that can
be queried from a DHT, and the ratio of metadata in the DHT vs file size.

> storage at all as the size of a clique becomes smaller.  You can
> counteract this by trying to form cliques via system similarity, but
> that has its own disadvantages, such as susceptibility to common
> vulnerabilities (contrast the Phoenix Recovery System to Pastiche).

I've queued the Pheonix paper for reading.

>   A web of trust might work.  I'm suspicious of how easy it seems to
> game trust-webs/gossip/etc. systems, but find those techniques
> alluring nonetheless.

Agreed.

>   In flud, there is a similar concept of private flud networks,
> wherein nodes interact exclusively with other nodes that share a
> common secret (or maybe at some point, shamir secret shares).  I'm not
> a big fan of that mode of operation, because I think that there are
> big advantages to choosing interactions from a pool of as many peers
> as possible, relying on trust metrics to let a node determine who is a
> reliable trader and who is not, much as you described.  It's there
> mainly for entities that might want complete control over their backup
> resources.

Sensible.

>   Right.  In both strategies (DHT | copies of contracts to peers), you
> have to store some metadata on nodes other than the ones where you
> store data, in order to be able to audit these contractual agreements
> in the case of a catastrophic loss.  The DHT approach still seems very
> simple and straightforward to me, largely because it formalizes where
> records are going to be stored/found in the protocol itself.  If you
> are going to instead place copies of contracts on secondary nodes, you
> have to decide 1) a [deterministic?] strategy for placement, 2) a
> strategy for replication, 3) define how often you rewrite these
> meta-contract files (once for every 'complete' backup session?, once
> for each file stored?, once per day?, etc) 4) a strategy for finding
> those secondary copies of records such that you can locate them when
> needed.  I imagine you've already settled on an overall scheme for
> this, and will discover how to tweak the details as you go.

I've made significant progress on understanding the flud DHT... although
I was curious.  Say you find 10MB on a peer that you want to insure stays
around.  You propose a contract with that peer.  Does that contract say:
Any 10MB I want for 1 week in exchange for any 10MB you want?

Or does it say:
The 10MB with sha256 foo for 1 week in exchange for any 10MB you want?

Or maybe:
The 10MB with sha256 foo for 1 week in exhange for the 9.75MB for sha256
bar that you want?

The difference in behavior between #1 and #2 seems substantial, and #3
seems difficult.  If it's #1 what is the advantage of finding that the block
you want if it saves you no storage?

>   Where 'all your peers' > 12?

Yeah some 1-10 times as many as required for the 20 slices for an erasure
encoded file.  So 20-200 contracted peers.

>   This file experiences a high amount of churn, because it needs to be
> overwritten each time a backup occurs, right?  E.g., suppose you've
> backed up your 350k files, and now have 350k*replication_factor
> contracts, which have been copied to all your peers in the contract
> file.  Now, during the next backup, 10 files are modified/added -- do
> you have to push the file again with all (350k+10)*replication_factor
> entries to all peers?

Say 60 contracts to 60 peers.  I'm not sure how flud saves the peers (on 
average) disk space, I get how the DHT finds you a peer with the storage, but 
how does the contract reflect savings in the finding of a already stored file?

In my setup I was just finding replications with the 20-200 peers and a file 
that was replicated 20 among the 20-200 storage nodes would only result in a 
storage charge of 1/20th of the filesize for each peer that had registered it. 
  So basically:
Peer A -> Peer B  I need sha256 foo stored
Peer B -> Peer A  I already have it stored for 19 other peers,
                   I'm decrementing your contracted free storage
                   by 1/20th of the filesize

Hmm, it just occured to me the redundancy I'm finding is random among the 
contracted peers.  So you would either have to query each peer saying "Do you
by chance have this block?  Or the much more suitable have a DHT for 
contracted peers.  Hrm, even uglier, since my peers can have contracts
with peers I don't know there can't be a dht of only the peers I have 
contracts with.

Hrm, maybe that's not so bad:

#for all files to be backed up calculate slices and queue them
for each dir
    for each file_that_needs_backed_up
        list_of_slices= calculate_list_of_slices(file)
        queue(list_of_slices)

#For all slices that need more redundancy see if any contracted
#peers have the slice.  If so register it with them.
for peers in contracted_list
    for (queued_slices) #find redundancies
       if (redundancy(slice)<desired_redunacy(slice))
          if (peer_has(peer,slice)
             assign_slice(peer,slice)

#nobody has the slices, pay full price for storage, send it
#to peer
for peers in contracted_list
    for (queued_slices) #find redundancies
       if (redundancy(slice)<desired_redundancy(slice))
             send_slice(peer,slice)

#all peers are full and we still have queued_slices.
while(queued_slices)
     if peer==null
         peer = find_peer_add_contract(1GB)
     if (has_slice(peer,slice)
         assign_slice(peer,slice)
     else
         if(send_slice(peer,slice))
            dequeue(slice)

Seems reasonable.  Basically all contracted peers are queried for the slices,
any slices found are registered.  Any left over slices are stored on peers 
until they all fill up.  After that new peers are found.

So for the above say 100 files changed since my last backup.  That's
2000 slices after erasure coding, and I have 40 contracted peers
me -> peer1 Here's my list of 2000 slices
peer1 -> me  I have these 50
me -> peer2 Here's my list of 1950 slices
peer2 -> me I have these 75
....
peer40 -> me have these 8
#I have slices that my contracted peers don't have
#rats, so I have to pay full price

me -> peer1 how much disk do I have left on contract?
peer1 -> me 50MB
me -> peer1 store these 50MB of slices
me -> peer2 how much disk do I have left on contract
...
me -> peer40 store these 50MB of slices
# rats still have slices
peer41 want a 128MB contract for 128MB....

Seems like with a large flud swam (say 100k peers)
for slice in 2000 :
     peer=dht_get(slice.sha256())
     #wait approximately 5-10 hops each the latency of crossing 0.5 of the
     #internet
     if (peer) # someone has the block
         if (peer.contract(peer)) # do we have a contract?
             if (peer.contract.spaceleft(peer)>slice.size #is it big enough?
                 register_slice(slice,peer)
         else # no contract
             if (request_contract(peer,slice.size) # can we have one
                if (peer.contract.spaceleft(peer)>slice.size) # we have one
                   register_slice(slice,peer) #register
                   slice=null
     else
         peer = list_of_contracted_peers
         while (slice not null and peer not null)
            if (peer.contract.spaceleft(peer)>slice.size) # have contract
                 if (store_slice(slice,peer)) # store it
                     slice=null
            peer=peer.next #no contract = try next peer
         while (slice) # peer list exhausted, still have slice.
             if (peer == null)
                 peer=get_new_peer()
             if (request_contract(peer,slice.size)
                 if (store_slice(slice,peer)
                     slice=null

Does that look about right for flud?

How many bytes does a contract take?  A few integers, a time stamp, 2
nodeIDs, and then a double signature (one per peer) with 2048 bit public keys?

>   Maybe you can address this by having one contract file per file
> backed up, and push it to all your peers?  Wouldn't it be convenient

Nope.

> to do this with a generic key-value storage system?  Maybe a smaller
> DHT that is formed just of your peers?  But then, that's sort of
> tricky, because unless your peers only backup to peers that you
> backup, their DHTs will contain peers that don't exist in yours, and
> yours in theirs...
> 
>   Maybe you can push only a subset of the contracts to each of the
> other peers, such that you have full coverage + redundancy?

Nope.  Just very few contracts that change rather slowly.

>   Or maybe you simply live with the overhead of pushing that contract
> manifest file to all peers each time, even on days when you've only
> backed up one changed file?

Nope.


>> Hmm, this might be the piece I'm missing.  So if ubuntu gutsy i386 /bin/i386
>> /bin/login has a sha256 of foo.  I can say get_dht(file_sha:foo) and get back
>> a nodeID of a node that stores it for me at no cost to myself?
> 
>   You can say get_dht(sha:foo) and get back the block metadata record
> that tells you where the blocks of that file are stored.  You can then
> contract with those nodes (if they are willing) to "store" those
> blocks for you (without actually xferring them), do verify ops on
> them, and/or re-store blocks to new nodes and update the block
> metadata record.

Your thoughts on peers serving out blocks to anyone that asks?  Maybe
throttled in bandwidth or blocks a day?  Maybe blocks/day per nodeID or
IP?

>> Dunno, I guess eventually.  Say said attacker has at least 1 new $3k
>> server with a 10mbit internet connection and 8 CPUs.  They do a few DHT
>> queries to map out the peers around the ID they are interested in,
>> say a MP3 of a popular song.  They pick 8 IDs, sha256 the IDs and see
>> if they are closer then the existing peer.  With 1000 clients is
>> that going to take long?  Or are you assuming millions?  Not that
>> millions is completely unreasonable.  If 8.5 M people are on skype at
>> this very moment presumably enjoying free phone calls... why not 1M
>> for free backups.
> 
>   Generating random sha256s is useless in flud without having the
> corresponding public/private keypair from which the sha256 nodeID is
> supposed to be generated[*1] (public/private keygeneration is not
> cheap).

Actually it is pretty cheap.  So if I generate a private and public
key and then take a sha1 of the public key on my desktop (using 1 cpu)
I get:
gpg: done
7afe9350416044154425f39417e021808e710006  /home/bill/.gnupg/pubring.gpg
real    0m0.961

Does your DHT use 256 bit keys?  160? 1024?  2048?

So a relatively normal desktop ($1200 - monitor) could so something on the
order of 360k a day. If I was the RIAA, had a collection of 100k
or so MP3s I wanted to watch for, had 2 8 CPU dual socket $3k servers
I could generate quite a million a day and then pick the closest
100k after a week and get a pretty high hit ratio for folks querying
those keys.  Even if the DHT regularly had 10M folks on it (more than
skype or azureus).

>  Finding partial hash collisions ala
> Hashcash/Herbivore-Join-Op is even more expensive.

I'm not familar with this part, to figure out what nodeID I'd get in the DHT
I need to do this?  Once I have the magic private/public keys and nodeIDs I 
can then decide how many resources to dedicate to actually joining the DHT.

Maybe only watch the top 40 MP3s for each week.

>   And yes, when the network is small, the attack is easier  But when
> the network is small, it isn't a very appealing target for attack.
> And when it is large, it is a difficult target to attack.  Kademlia's
> preference for older nodes also helps here, especially since nodes in
> a backup dht would be incentivized to stay online [nearly permanently]
> instead of constantly churn (as in filesharing apps).

Agreed.

>   Yeah, there is a thread (or two) somewhere on this.  Basic summary
> is that it isn't that hard to prevent this type of DHT abuse by
> requiring *all* data in the DHT to be stored transparently.  This
> allows nodes to immediately reject data that doesn't conform to the
> strict format of correct data layout, and also allows nodes to audit
> the DHT data to confirm that it actually points to valid block data.
> Violating any of this results in getting blacklisted and having the
> DHT records purged.

Indeed, the stricter you are the less illegal data you can put in
a DHT entry.

>   In other words, flud's DHT is not a generic key-data store.  It is a
> highly constrained and auditable key-value store.
> 
>   A node would get much better and higher-throughput data storage
> resources by simply storing files according to the protocol, so this
> type of attack is only useful to an irrational malicious node anyway.

Well that better and higher throughput storage requires paying with
storage.  A rational node with no storage to offer has no other choice.
Granted that doesn't seem particularly likely.

>   This has helped me reevaluate, too.  I am of course prone to defend
> the existing design or at least rationalize the decisions made to
> arrive at it, but that's all for the sake of vetting it.  I am very
> intrigued by the areas where your thoughts differ from mine, and I
> think you've hit on some very promising ideas which I'd very much like
> to see tried out.  I'm also pleased to see that we have *a lot* of
> overlap in how we think this should be accomplished -- that's very
> nice validation.

Agreed, the missing pieces definitely had me thinking, why the hell would
you do that.... but now I see at least some of the reasons.  Unfortunately now
I have 2 different ideas without a clear winner.  Time for a long dog walk
to ponder.




More information about the flud-devel mailing list