[flud-devel] flud small file performance

Alen Peacock alenlpeacock at gmail.com
Mon Mar 6 20:03:43 PST 2006


More notes to myself.  As these were written for an audience of me,
there might not be sufficient context for grokking everything.  Please
feel free to ask why certain decisions were or were not made (even if
it ends up being months from today).

------------------
Jan 12
I did some initial performance tests yesterday, using a file size mix
that mimicked my home directory.  My home directory files fell into
the following buckets (in byte sizes):

[1 - 10] = .62%  (14)
[10 - 100] = 3.79%  (85)
[100 - 1000] = 13.97%  (313)
[1000 - 10000] = 51.20%  (1147)
[10000 - 100000] = 24.82%  (556)
[100000 - 1000000] = 4.24%  (95)
[1000000 - 10000000] = 1.33%  (30)
[10000000 - 100000000] = 0%  (0)
[100000000 - 1000000000] = 0%  (0)
[1000000000 - 10000000000] = 0%  (0)
[10000000000 - 100000000000] = 0%  (0)

The mix I came up with, composed of files from the above study, was:

[1 - 10] = 1.96%  (1)
[10 - 100] = 3.92%  (2)
[100 - 1000] = 13.72%  (7)
[1000 - 10000] = 50.98%  (26)
[10000 - 100000] = 23.52%  (12)
[100000 - 1000000] = 3.92%  (2)
[1000000 - 10000000] = 1.96%  (1)
[10000000 - 100000000] = 0%  (0)
[100000000 - 1000000000] = 0%  (0)
[1000000000 - 10000000000] = 0%  (0)
[10000000000 - 100000000000] = 0%  (0)

The complete size of the sample was 3217.17K ('du' showed 3.6M).

To store the above 51 files, onto a mix of 50 flud nodes (all local to
the same box), it took about 406 seconds, for a transfer rate of just
under 65kbps.  This is unacceptable for all but dialup users, who have
insufficient bandwidth to perform this type of backup in the first
place.

Will do some more testing to identify the inefficiencies, but am
confident that most arise from smaller files.  By 'smaller files', I
think I am necessarily including the 51% bucket, but will identify the
cutoff with testing.

The fix for this is to aggregate small file transfers.  The simplest
way to do this is to simply tar all small files together.  Another
alternative is to pipeline all the small file data (from different
small files) going to the same host and push it all at once (perhaps
by appending to a tarfile instead of sending immediately, then sending
the whole bundle with a special STORETAR op where the recipient
unbundles and stores the file blocks normally).  The latter is
conceptually more attractive, but from a practical standpoint the
former seems more straightforward.

In both cases, the solution complicates a few things.  For one, the
single-file-as-the-unit-of
-granularity model is disrupted; store,
update, and verify operations may need to operate over mutliple files
(may not be true for option #2).  Single-instance store advantages go
away for small files (which were perhaps non-existent in the first
place, so this is a wash).  Complexity is added by needing two
seperate paths of execution in the code for small files vs. large
files.

-------

Just did a quick test with the single file in the 1M-10M range (1.2M)
on the same 50 node/same box setup.  It stored in 7 seconds, for a
transfer rate of 1.4mbps -- a much more reasonable performance.

Here's a table with other performance points:

filesize     time    rate
 60M          87s     5.8mbps
 25M          30s     7.0mbps
 7.7M         16s     4.0mbps
 5.1M         12s     3.6mbps
 1.2M         7s      1.4mbps
 844K         7s      987kbps
 316K         7s      370kbps
 28K          7s      33kbps

 So it looks like there is a static overhead of at least 7s for small
files, and that if we are aiming at at least 1mbps throughput, files
smaller than 1M need to be aggregated.  I need to figure out the
implications of, say, keeping a tarball of all files <1M laying around
-- how big will that file be and how many files will usually fit
inside it?

 It also looks like throughput tops out somewhere between 25M and 60M,
and then starts to decline.  We already know that the current coder
impl doesn't handle arbitrarily large files (must currently hold
entire file in memory -- will change with next LDPC release), but the
declining performance warrants further investigation -- does it
continue to decline indefinitely?  If not, what is the floor
throughput for large files?  What is causing the decline (coder /
memory allocation?  File tput of local segments?)?  Where is the peak?

---

> filesize     time    rate
>   60M          87s     5.8mbps
   35M          47s     6.2mbps
>  25M          30s     7.0mbps
   10M          18s     4.7mbps
> 7.7M         16s     4.0mbps
>   5.1M         12s     3.6mbps
>  1.2M         7s      1.4mbps
>  844K         7s      987kbps
>  316K         7s      370kbps
>  28K          7s      33kbps



----------------------
Jan 13
The more I think about this problem, the more attractive the second
approach (combining multiple requests to the same host into a single
'STORETAR' request) becomes.  The name of the tarball on the source is
simply the nodeID of the destination, and instead of sending a STORE
request, the file is just appended to this tarball.  The rest of the
file store operation proceeds as normal (with independent verification
of file metadata at the DHT nodes delayed.  This verification isn't
implemented yet anyway).  At some point (when the optimal size is
reached or a certain time period has expired), the tarfile is sent. 
The target of this tarfile could untar it and store the segments
individually, but it is better to save it as is, and to search into it
when VERIFY or RETRIEVE requests are received.  The target can rename
it after the nodeID of the sender, which also allows the target to
only answer VERIFY/RETRIEVE for requests originating from sender.  If
a tarball by that nodeID already exists, the files inside the new
tarball are just appended to the old one.

 This also does away with the small file support that is already in
the code.  As a natural consequence of implementing the above, small
file STORE requests will become virtually non-existent (can still get
small tarballs if source doesn't have enough small files to make it
large, or if time-out on sending the STORETAR occurs -- but the
reciever can still aggregate these later to introduce better
efficiencies).



VERIFY can still work as normal for these small files stored as tars:
the source sends a challenge for a particular block of data at a
particular hash ID.  The recipient of this request looks for the file
by filename first, then searches its senders_nodeID.tar file for the
hashID, untars that one file to a tmp directory if found, performs the
response with the challange, sends it back, and deletes the one
untarred file from the tmp directory.

 Same story for RETRIEVE.

-------------
Mar 6, 2006:
We are now averaging 256 - 512kbps transfer rate on a single host
running 20 - 50 nodes, across even smaller file sizes.  The bandwidth
appears to be CPU-limited under these conditions, as all
challenge/response and other cryptographic functions have to share a
single CPU.  All disk I/O is also going to one drive.  I need to rerun
the above experiments to provide more concrete numbers, but
implementing the small file aggregations as tarballs helped
tremendously, without sacrificing the single-file as smallest unit of
granularity semantics.

As another note, STORETAR was never implemented seperately.  Instead,
STORE determines if it received a tar file or not and just does the
right thing.  Same with VERIFY and RETRIEVE.




More information about the flud-devel mailing list