[flud-devel] DHT justifications (was: DHT Performance? Design?)

Bill Broadley bill at broadley.org
Tue Nov 13 21:16:56 PST 2007


Alen Peacock wrote:
> On Nov 6, 2007 11:15 PM, Bill Broadley <bill at broadley.org> wrote:
>>>  In order to do verify ops, a node must possess a copy of the
>>> data.
>> Er, what?  If I ask you to store a file, I prove I have it, you should
>> store it for as long as we have an agreement.  You can of course challenge
>> any file I've stored for you.  But you shouldn't continuously make me
>> reprove I have the data.  It is after all a backup service.
> 
>   If the initiating node doesn't keep a copy of the data, how does it
> do verify ops?

Well I was thinking of something like challenging 100 files a day, if they are 
random you have a pretty good idea if your peer silently deleted some major
fraction of your storage with a high degree of confidence.  So you could 
challenge from the files you do have around (old and new) but not previous 
versions of new files.  Or if you prefer pre-calculate as many challenges as 
you want.   While a peer could analyze and store every challenge, do a 
statistical analysis of coverage and it's correlation with creation time. 
Then with all those CPU cycles and storage start attacking your files around 1 
month old trying to guess which ones you might not have a copy of, maybe 
repeated an older challenge to, or haven't challenged..... don't seem like a 
practical attack.

>   You could, of course, precompute a bunch of challange/response pairs
> and then use those (we did this in ABS)

ABS is some flud predecessor?

> but if you don't have a local
> copy fo the data, at some point you run out of challenge/response
> pairs have to download a copy of the data again.  And if every node is
> doing that, it results in extra overall system load and reduced
> efficiency.

Agreed, but if you are storing 10-100k files for presumably 20-200 other peers 
how do you know which ones will get future challenges?  Certainly with 20/40 
coding you only need to keep 1/2 the peers from throwing away your data, or 
maybe keep 40 peers from throwing away more than 1/2 your data.  Some files 
are static, so I don't expect creation data to be a realistic way to cheat.  I 
kind of expected the challenges to be just a statistical sampling of files on 
a peer, not some comprehensive (and expensive) challenge of each block on each 
peer (which would make it trivial to detect which ones the peer didn't have a
challenge for).

I expect the huge bulk of files are mostly static, some important files will 
change often (but invisibly to the peer who only sees encrypted blocks).  The 
value of file with many versions seems to decrease with the number of versions 
and how old it is (if it was more valuable you would in general not depend
on a backup system to do version control).  Even if successfully attacked that
doesn't seem like a big deal.

My goal is basically to replace tape, take full advantage of the disk and
network connectivity to provide similar coverage to the normal tape backup
and reduce the expensive nature of centralized disk storage.  Seems like the 
old school tape backups were often level zero's 4 times a month, incrementals 
on weekdays, recycle tapes at the end of the month.  So you can protect
against the disaster (load the last level 0 + all incrementals since then),
or the more subtle mistake, I deleted a dir and didn't notice until 2 weeks
later, or even the wow, I made some aggressive changes to this document/source
code and I actually want the older version to start work on again.  So my
goal was that if I donate 10GB to the p2p pool I would store 10GB from the 
pool.  But I could use that 10GB the pool owes me in any way.  10 1GB level 
0's if I'm really paranoid.  Maybe 1 10GB level 0 if I'm really brave.  Or 
some mix, say 2 4GB level zero's and the last 2GB of churn which would include
a daily snapshot of every file that changes.  So you might have a months
worth of changes of that file you edit daily.

Granted erasure coding greatly decreases the needs for replications, but it
doesn't remove the need for multiple versions.

>   flud is a pure mirroring backup service, not a generic storage grid.

Ah, ok, I had missed that, I was thinking of the middle ground, not just
mirroring, more like traditional backup systems, but nowhere close to
a true distributed filesystem.

>  Rather than complicating things to support a non use case, flud just
> requires that the initiator retain a copy of the data for as long as
> it wants it to exist in the network.  Once verification operations
> cease, data can begin to decay.

Sounds kind of scary on the surface, I was planning to depend on contracts for 
float time, I'll store X storage for Y storage and will give you a grace 
period of z days in the case of a disaster.  During those days I won't delete 
any of your files.  But the statistical removal of files could work you just 
have to figure that in when you calculate a reasonable redundancy factor.

It does kind of scare me that if I had a disaster and because of the downtime 
I lose even a few % of my blocks that a fair number of my files (even if it's 
a very small percentage) could be gone.  Seems like only some kind of grace 
period can allow for 100% restore.  I find 100% restores very attractive ;-).




More information about the flud-devel mailing list