[flud-devel] filesystem analysis proposal

Bill Broadley bill at broadley.org
Wed Nov 14 01:18:20 PST 2007


I've a pretty narrow point of view on the world, I've been in charge of from 
anywhere to a few machines to a few hundred at various times over the last 15 
years spanning floppies, TK50's, a few generations each of DATs, DLTs, and 
disk backups.

I've got my own opinions on the typical filesystem, and while I've seen 
studies (like the first reference in the ABS paper machines from 1999) I'm not 
they are particularly relevant to the average internet user in 2008.  After 
all web-browsing, ipods, itunes, bittorrent, google, podcasts, 6-12MP digital 
cameras, HDTV, h264, youtube, video recording cellphones, TIVO, $15 broadband, 
$100 500GB disks, netflix, $200 linux desktops, and related technology, 
social, software, and economic changes have significantly changed the 
landscape and practicality of p2p network backups.  Things have changed quite 
a bit in the last decade.

Seems like it could be really useful to quantify filesystems, the numbers I'd 
hope to collect are:
* How many files per machine, per dir, and a histogram of the file sizes
* How fast do the storage needs grow per machine
* What is the churn? ( how many files have changed in the last day/week )
* what is the nature of the sum of the files with a new modified timestep vs
   sum of rsync changes
* How much redundancy is on the disk (how many files are on the disk more
   than once)
* What percentage of files is in common (both number and total size) between
   a population of N people.

Hmm, maybe all the above for important files, i.e: /home, /disk1, /disk2) vs
everything that came off the install media.

If we asked folks on this list and p2p hackers (and any other venues that 
would be a good place to ask) to run a program that would collect the above 
numbers and upload checksums any guess on how many takers we would have?

Each submission (via email, anonymous email, tor or direct transfer) could 
have a random unique ID, a list of checksums, counts, summary info and 
filesizes.  The summary info would just be a histogram of filesizes, total 
files, total storage, file thats changed in the last week/day, size that
changed in the last weekday, and size of the rsync difference in the last 
week/day.

Of course the most important question... is it worth it?  Should I start
coding it?  Comments?





More information about the flud-devel mailing list