[flud-devel] filesystem analysis proposal

Alen Peacock alenlpeacock at gmail.com
Wed Nov 14 09:25:09 PST 2007

Bill, I think this is an excellent idea, and would be very useful to
folks working on a variety of different projects.  It would certainly
be invaluable to flud.  I'm guessing tahoe might be interested, too
(although they might already have statistics gathered from allmydata
users).  We could ping a couple of other mailing lists, too.  I can't
imagine that anyone working on any sort of distributed file store
would not be game for submitting some numbers.

I'm not aware of any published studies since the 1999/2000
Bolosky/Deuceur papers, and I agree that things have likely changed in
the last 8 years.

Let me know what I can do to help you out with this.


On Nov 14, 2007 2:18 AM, Bill Broadley <bill at broadley.org> wrote:
> I've a pretty narrow point of view on the world, I've been in charge of from
> anywhere to a few machines to a few hundred at various times over the last 15
> years spanning floppies, TK50's, a few generations each of DATs, DLTs, and
> disk backups.
> I've got my own opinions on the typical filesystem, and while I've seen
> studies (like the first reference in the ABS paper machines from 1999) I'm not
> they are particularly relevant to the average internet user in 2008.  After
> all web-browsing, ipods, itunes, bittorrent, google, podcasts, 6-12MP digital
> cameras, HDTV, h264, youtube, video recording cellphones, TIVO, $15 broadband,
> $100 500GB disks, netflix, $200 linux desktops, and related technology,
> social, software, and economic changes have significantly changed the
> landscape and practicality of p2p network backups.  Things have changed quite
> a bit in the last decade.
> Seems like it could be really useful to quantify filesystems, the numbers I'd
> hope to collect are:
> * How many files per machine, per dir, and a histogram of the file sizes
> * How fast do the storage needs grow per machine
> * What is the churn? ( how many files have changed in the last day/week )
> * what is the nature of the sum of the files with a new modified timestep vs
>    sum of rsync changes
> * How much redundancy is on the disk (how many files are on the disk more
>    than once)
> * What percentage of files is in common (both number and total size) between
>    a population of N people.
> Hmm, maybe all the above for important files, i.e: /home, /disk1, /disk2) vs
> everything that came off the install media.
> If we asked folks on this list and p2p hackers (and any other venues that
> would be a good place to ask) to run a program that would collect the above
> numbers and upload checksums any guess on how many takers we would have?
> Each submission (via email, anonymous email, tor or direct transfer) could
> have a random unique ID, a list of checksums, counts, summary info and
> filesizes.  The summary info would just be a histogram of filesizes, total
> files, total storage, file thats changed in the last week/day, size that
> changed in the last weekday, and size of the rsync difference in the last
> week/day.
> Of course the most important question... is it worth it?  Should I start
> coding it?  Comments?
> _______________________________________________
> flud-devel mailing list
> flud-devel at flud.org
> http://flud.org/mailman/listinfo/flud-devel_flud.org

More information about the flud-devel mailing list