[flud-devel] DHT justifications (was: DHT Performance? Design?)

Bill Broadley bill at broadley.org
Mon Nov 5 21:43:24 PST 2007

Alen Peacock wrote:
> On Nov 2, 2007 5:03 AM, Bill Broadley <bill at broadley.org> wrote:
>> Thoughts on peers being optionally generous?  Allowing non-contracted
>> with downloads for blocks they stored?
>   I lean heavily towards only allowing nodes that own content to
> download it.  This prevents the system from devolving into a generic
> filesharing application, where users can pass around sK to share
> files.

Hrm, I can see that.  Then again a selfish client might offer a contract,
download the file, then disappear forever.  That does present a rather
awkward question.  If someone uses you as a block store, you store the
metadata in the DHT, then person #2 says "Hey, can I register with you
for that same file?".  If you say yes and allow them to restore you are
potentially committing a crime.  I'm not a lawyer, but maybe common
carrier status could protect you.

Hmmm, seems like one thing you could do is challenge the requestor that want 
to register for a particular file.  Produce a checksum of a random range, that 
would insure it couldn't be used as a distribution mechanism.

>  It also prevents malicious parties from downloading content
> that they don't have posession of in the first place.  In order to

Sort of, it just requires generating a pair of keys and asking for a contract.

> keep things simple, I also lean towards symmetric trading
> relationships (1byte for 1byte)

Doesn't that mostly remove the advantage of your content addressable network
and the payback of replicating metadata 12 times?

>, but there is nothing that would
> prevent a node from offering disproportionate trades, and that might
> be something that a node with popular content would want to do.

That seems to be the only way I can think of to take advantage of flud's
ability to globally find a given block.  So every file is split across
20 hosts.

>   True, a contract-per-file granularity is probably too fine-grained.
> In flud, "contracts" are very temporary, and should be replaced by
> Samsara-style unforgeable claims before expiring, i.e, contract
> lifetime should usually be measured in hours or less.  I am also in
> agreement that nodes should contract for more-than-current-file-size,
> and that there should be some sort of [dynamic?] floor value for
> contract/claim size.

Well if I backup N files, I'm going to get N*20 blocks, and for all
blocks that already exist I'm likely to get a relatively random
set of peers, right?  Not to mention I specifically don't want more than 1 of
a given files 20 blocks to be on the same host (for maximum protection against 

>   The storer chooses where to store data in flud, so selfish peers
> can't elect to do so themselves, except through collusion.  Two other

Well it's both, right?

> things to keep in mind here: 1) storing popular content causes extra
> bandwidth and IO costs to the storer, since all the requesting nodes
> do verify ops constantly, and the frequency of restore increases, and


> 2) storing nodes can choose to store popular content elsewhere than
> where it is already stored, so even if a node colludes to be the first
> storer of some popular file, it doesn't mean that nodes won't learn to
> avoid it as a storer later.  Some variant of #2 allows the system to
> avoid hotspots in the storage layer.

Sensible.  Of course it's likely to be cheapest to get a contract with
the node already storing a block.  The paranoid can never use existing
blocks and pay full price for that storage.  Makes sense, I like that
you can pay more for storage and get a higher level of service.

>> The 2 I can think of is limitations of files per second that can
>> be queried from a DHT, and the ratio of metadata in the DHT vs file size.
>   Mainly because we can pipeline things with lots of ops happening
> asynchronously, I'm not too worried about the files-per-sec issue (but

The paper I mentioned previously seemed to think that a lookup every 2 seconds
was about as fast as you are every going to get, the improvement from 4 
threads to 8 seems minimal, even 10 threads looks like 2 seconds or so.  If 
your bandwidth is higher than 40 (20 2 second lookups) seconds per average 
filesize it's likely to be a bottleneck.(at least if the paper and my 
interpretation of it is correct).  My uplink at home is 45KB/sec or so, my 
average filesize is 321KB.

After making that point I think I have an optimization that could help.  Run a 
backup, erasure code the files into n*20 blocks (or so), then sort them by 
their sha256.  That way you are likely to store more than one file on the same 
peer, and even if not you are likely to be within a hop or two (instead of the 
average 1/2 of the DHT away).

> again, I need to get some perf numbers).
>   The ratio of meta/file seems quite acceptable, at least on synthetic
> workloads I've been pushing through flud (I'll have to dig up some
> numbers... thought I'd posted some to the list in late '05 / early
> '06).  Of course, this falls apart a bit if you push pathological
> cases, such as a user who backs up 1GB of data composed entirely of 1B
> files.

Heh, I have a small perl script that outputs the following, I'll drop it
on the web somewhere if anyone is interested:
Total directories =    48773
Total files       =   356482
Total size        =      109.3 GB
Average Directory =        7.3 files and     2.29 MB
Maximum Directory =     6246 files //var/lib/dpkg/info
Maximum File size =        1.03 GB //home/bill/na.dat
Average filesize  =      321.36 KB
      37988 files are between     0.00  B and     1.00  B
      35663 files are between     1.00  B and   268.00  B
      35685 files are between   268.00  B and   649.00  B
      35691 files are between   649.00  B and     1.11 KB
      35670 files are between     1.11 KB and     1.94 KB
      35669 files are between     1.94 KB and     3.38 KB
      35652 files are between     3.38 KB and     5.48 KB
      35650 files are between     5.48 KB and    12.14 KB
      35650 files are between    12.14 KB and    38.10 KB
      31253 files are between    38.10 KB and     1.03 GB

So an average file on my filesystem would take me around 7 seconds to upload
(significantly less than 40 seconds to do the DHT lookups).  You mentioned the 
erasure coding splits files into 20 pieces.  How big are those 20 pieces 
relative to the filesize?  How many pieces can be lost and still recover the file?

>   Yes.  Except that, as I mentioned above, a contract as long as 1
> week in flud seems a bit long -- the contract only needs to exist for
> as long as it takes a node to backup some data (if it has some ready)
> or an unforgeable claim that can be replaced with valid data later (if
> it does not yet have data ready).  Once data is backed up, the storing
> node is expected to keep it around forever.


>   Well, that's not exactly true... It will keep it around forever
> where "forever" has some decay, according to frequency of
> verify/store/retrieve operations.  A block that hasn't been 'touched'
> in a very long time can be deleted with some probability.  This is
> okay, because flud is not an archive service, it is a mirror of data
> backed up from another computer, and that other computer has to
> maintain the backup copy through these occasional 'pings' or expect it
> to be eventually purged.

Sounds very reasonable.  Although I see nothing that would preclude flud
being used as an archive service and an arbitrary number of files, versions,
etc.  As long as the storage node is around to "pay" for that storage.

>   There is also an explicit 'delete' op, some variant of which is used
> to obtain another temporary contract when a node wishes to exchange a
> claim for real data.

Sure, delete is very handy when you decide backing up file A is more important
than file B and you have already used the amount of storage available.

>> Say 60 contracts to 60 peers.  I'm not sure how flud saves the peers (on
>> average) disk space, I get how the DHT finds you a peer with the storage, but
>> how does the contract reflect savings in the finding of a already stored file?
>   Disk space is one scarce resource.  The other is bandwidth.
> Convergent storage saves the requestor the latter

Because the requestor doesn't have to send the data.

, and the storer the
> former.

Because it doesn't have to store another copy.

So say 10 peers need to store 1/20th of ubuntu i386 feisty /bin/login and they
all ask the DHT for someone with the sha256 of that 1/20th, all find the same 
storage node (which we can call S) and all negotiate for the storage.  Then 
some nodes upgrade to gutsy, run a backup.  Would you expect those nodes to 
ask S to store the new block?  Or just troll the DHT for someone to exchange 
data with?

>  Figuring out which is more scarce is difficult, because it
> varies among users.  I would venture to say that today, for most
> dsl/cable users, upload bandwidth might be even more constrained than
> free disk space (certainly true in my own case).  If someone can save
> the 2 days in upload time that it takes to store their mp3 collection,
> that's an attractive deal.


>   Convergent storage can also be useful for a user's unique content,
> in the case where they are imaging a new machine but with all the
> content copied from the old.  Or giving the backup software a second
> try (maybe after an OS upgrade, etc.).  To tell you the truth. I think
> this is actually where convergent storage comes in most handy, and
> where users end up seeing most *actual* benefit.

Indeed, backing up your home desktop, laptop, and work desktop is likely
be have many files in common.   Then again that kind of benefit would
exist if you had the small group of peers with contracts.  The globally
addressability seems mostly justified when you have files in common with
people you don't know and never would have met without the DHT.

If you are splitting a file in 20, (multiplying the metadata in 20),
and then replicating that 12 times (to protect from churn) it seems like there 
would be a minimum filesize that would be worth it.  Certainly if 
metadatasize*12*20 is bigger than the file than I suspect there might be
a better way to handle it.  That metadata is at least a sha256 sum right?
sha256*20*12 = 7680 bytes, larger than 80% of so of the files on my

Maybe just switch to whole files if smaller than 8KB?

>   Occasionally, I get the crazy feeling that some sort of namespaced
> DHT scheme could work for these types of cases.  And then I disgard
> the idea because it seems unworkable.  I'm not aware of anyone
> pursuing it, could be an interesting diversion...

Yeah, or a single dht for each peer, seems like it would be alot of

>   I really can't think of anything wrong with this approach at all
> right now.  I'm eager to hear how it works in practice, problems you
> encounter along the way, etc.

I'll keep you posted.  You could sort your stored blocks by their sha256
so you know which peer would be most likely to have a block, alas you don't
have very good odds on finding peers that already have your blocks.

> should be 'block_metadata=dht_get(slice.sha256())'

Ah, right, I knew that.

>   My 2 cents: stick with your general design and tweak the details as
> you go.  I really would like to see where you end up with this.  Of
> course, if you do come to the conclusion that flud's design is the
> best possible design in the universe EVAR (well, or "mostly right,
> anyhow"), I'd much rather you pitch in and help out in speeding up its
> progression than start from scratch ;)

I think I understand most of flud, at least at the 50k foot level, I'm just 
wrestling having agreements with:
* potentially (files in convergent storage * 20) peer agreements
* a local dht store of:
   (average files per peer) * 20 (erasure coding blocks) * 12 (DHT replication
   factor) records.
* what fraction of an average users backup is going to be found in convergent
   storage that wouldn't be found among a small group of storage peers.

>   Speaking of starting from scratch, if we end up having at least some
> components in common, that'd be really cool too...

Seems like the easy to share tools are the DHT (if you aren't using one of
the existing DHT implementations), the erasure coding (ditto).  Speaking
of which is the erasure coding written in Python?  How long does it take
to erasure code a 1MB file?  10MB?  100MB?  I did some benchmark tests with 
DIBS erasure coding vs some random C library I found.  For identical coding it 
was  something like 50 times faster (my memory is fuzzy, I'll try to look up 
the timings if there is interest).

More information about the flud-devel mailing list