Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file count #31

Closed
wants to merge 8 commits into from
Closed

Add file count #31

wants to merge 8 commits into from

Conversation

TimOrme
Copy link

@TimOrme TimOrme commented Jul 23, 2019

This is a rudimentary implementation, and just outputs the total number of files found as well. It includes directories in the count as well, and is very naive in its reporting in that sense. We leverage diskus for reporting and this metric is helpful for us.

Haven't submitted many PRs before, so let me know if you need other things added.

This is a rudimentary implementation, and just outputs the total
number of files found as well.  It includes directories in the
count as well, and is very naive in its reporting in that sense.
@TimOrme
Copy link
Author

TimOrme commented Jul 23, 2019

Also should note that this might be a breaking change in the sense that the output has changed, and any clients parsing using that output will likely break. Maybe its better to add this as a flag to maintain that? Wasn't sure if the API had been marked as stable yet.

@sharkdp
Copy link
Owner

sharkdp commented Sep 15, 2019

Thank you very much for your contribution (sorry for the late reply)!

I'd like to keep diskus simple, but this seems like a reasonable addition. Before we merge this, we should

  • distinguish between files and directories and output both
  • Make sure that there is no significant performance overhead (see README for information on how to run benchmarks)

Also should note that this might be a breaking change in the sense that the output has changed, and any clients parsing using that output will likely break.

Thank you for mentioning it. I wouldn't be worried about breaking changes in this respect. But now that you mention it, it might make sense to add something like a --plain option which would provide a easily parseable output (probably just the number of bytes). I'll create a new ticket.

@TimOrme
Copy link
Author

TimOrme commented Oct 9, 2019

Ive added the change to split files and directories. I'll work on running the hyperfine tests as well. You don't happen to have a script to generate that test data do you?

Also looks like there's been some restructuring so I'll resolve the conflicts here as well in short order. Let me know if you see any issues with the implementation logic here as well though.

@sharkdp
Copy link
Owner

sharkdp commented Oct 9, 2019

You don't happen to have a script to generate that test data do you?

Unfortunately, no :-(

@TimOrme
Copy link
Author

TimOrme commented Oct 9, 2019

No worries, I'll write up a bash script for it as well. Thanks!

@TimOrme
Copy link
Author

TimOrme commented Oct 9, 2019

So I'm not exactly sure how to validate these results. sn seems much faster than what is listed on the README.

Warm Cache:

hyperfine --prepare 'sync; echo 3 | sudo tee /proc/sys/vm/drop_caches' '~/workspace/diskus/target/release/diskus' 'du -sh' 'sn p -d0 -j8' 'dust -d0'
Benchmark #1: ~/workspace/diskus/target/release/diskus
  Time (mean ± σ):     286.1 ms ±  56.0 ms    [User: 195.6 ms, System: 418.6 ms]
  Range (min … max):   226.0 ms … 365.3 ms    10 runs
 
Benchmark #2: du -sh
  Time (mean ± σ):      1.854 s ±  0.095 s    [User: 128.1 ms, System: 540.7 ms]
  Range (min … max):    1.661 s …  2.001 s    10 runs
 
Benchmark #3: sn p -d0 -j8
  Time (mean ± σ):     292.2 ms ±  34.0 ms    [User: 74.2 ms, System: 340.6 ms]
  Range (min … max):   250.5 ms … 355.2 ms    10 runs
 
Benchmark #4: dust -d0
  Time (mean ± σ):      2.006 s ±  0.093 s    [User: 267.7 ms, System: 537.7 ms]
  Range (min … max):    1.927 s …  2.200 s    10 runs
 
Summary
  '~/workspace/diskus/target/release/diskus' ran
    1.02 ± 0.23 times faster than 'sn p -d0 -j8'
    6.48 ± 1.31 times faster than 'du -sh'
    7.01 ± 1.41 times faster than 'dust -d0'

Cold Cache:

hyperfine --warmup 5 '~/workspace/diskus/target/release/diskus' 'du -sh' 'sn p -d0 -j8' 'dust -d0'
Benchmark #1: ~/workspace/diskus/target/release/diskus
  Time (mean ± σ):      65.1 ms ±   7.4 ms    [User: 81.1 ms, System: 138.9 ms]
  Range (min … max):    53.4 ms …  87.8 ms    44 runs
 
Benchmark #2: du -sh
  Time (mean ± σ):     129.6 ms ±   3.7 ms    [User: 39.9 ms, System: 88.6 ms]
  Range (min … max):   127.2 ms … 145.1 ms    22 runs
 
Benchmark #3: sn p -d0 -j8
  Time (mean ± σ):      42.1 ms ±   2.2 ms    [User: 42.9 ms, System: 108.0 ms]
  Range (min … max):    39.7 ms …  52.5 ms    67 runs
 
Benchmark #4: dust -d0
  Time (mean ± σ):     249.8 ms ±  52.6 ms    [User: 131.6 ms, System: 114.5 ms]
  Range (min … max):   176.4 ms … 348.3 ms    16 runs
 
Summary
  'sn p -d0 -j8' ran
    1.55 ± 0.19 times faster than '~/workspace/diskus/target/release/diskus'
    3.08 ± 0.18 times faster than 'du -sh'
    5.94 ± 1.29 times faster than 'dust -d0'

However, comparisons to the version installed via cargo look comparable:

hyperfine --prepare 'sync; echo 3 | sudo tee /proc/sys/vm/drop_caches' '~/workspace/diskus/target/release/diskus' 'diskus'
Benchmark #1: ~/workspace/diskus/target/release/diskus
  Time (mean ± σ):     255.0 ms ±  17.1 ms    [User: 164.1 ms, System: 375.7 ms]
  Range (min … max):   232.3 ms … 279.5 ms    10 runs
 
Benchmark #2: diskus
  Time (mean ± σ):     271.3 ms ±  26.4 ms    [User: 178.7 ms, System: 413.6 ms]
  Range (min … max):   240.4 ms … 328.9 ms    10 runs
 
Summary
  '~/workspace/diskus/target/release/diskus' ran
    1.06 ± 0.13 times faster than 'diskus'

And

hyperfine --warmup 5 '~/workspace/diskus/target/release/diskus' 'diskus'
Benchmark #1: ~/workspace/diskus/target/release/diskus
  Time (mean ± σ):      65.6 ms ±   7.6 ms    [User: 84.7 ms, System: 137.3 ms]
  Range (min … max):    55.3 ms …  81.2 ms    36 runs
 
Benchmark #2: diskus
  Time (mean ± σ):      68.3 ms ±   8.1 ms    [User: 83.0 ms, System: 149.0 ms]
  Range (min … max):    57.0 ms …  93.1 ms    43 runs
 
Summary
  '~/workspace/diskus/target/release/diskus' ran
    1.04 ± 0.17 times faster than 'diskus'

@TimOrme
Copy link
Author

TimOrme commented Oct 9, 2019

Also, for posterity, here's the script I used to generate the test directory:

#!/bin/bash -e

# Generates a rather naive testing directory
# Aims for 400,400 files totalling 15GB
# Nested directories totalling 100,100 (plus one for the top dir.)

BASE_DIR=$1

# 15GB/400,000 ~= 370KB
FILE_SIZE_IN_KB=370

for i in {1..100}; do
    mkdir $BASE_DIR/$i

    for f in {1..4}; do
        head -c ${FILE_SIZE_IN_KB}K </dev/urandom > $BASE_DIR/$i/$f.txt
    done;

    for j in {1..100}; do
         mkdir $BASE_DIR/$i/$j

        for f in {1..4}; do
            head -c ${FILE_SIZE_IN_KB}K </dev/urandom > $BASE_DIR/$i/$j/$f.txt
        done;
    done;
done;

@sharkdp
Copy link
Owner

sharkdp commented Oct 9, 2019

Which version of sn did you use? I have

> sn -V
The Tin Summer 1.16.4

I'm pretty sure that's the version I used for the benchmark (which I updated recently). I should have documented the exact versions.

It might also be your disk / machine or the test directory. I can try to run the benchmark again with your test directory.

@TimOrme
Copy link
Author

TimOrme commented Oct 9, 2019

Ah I have a newer version:

bash$ sn -V
The Tin Summer 1.21.8

I'm not actually able to install 1.16.4 due to some feature restrictions in my rust version. If you're able to run tests on your end that'd be great.

@TimOrme
Copy link
Author

TimOrme commented Oct 10, 2019

Realized there were bin releases of it. Even with the old version I still see better performance with sn compared to the README. Maybe my test folders are different, or something else system-wise is?

/home/torme/extracts/sn --version
The Tin Summer 1.16.4
hyperfine --prepare 'sync; echo 3 | sudo tee /proc/sys/vm/drop_caches' '~/workspace/diskus/target/release/diskus' 'du -sh' '/home/torme/extracts/sn p -d0 -j8' 'dust -d0'
Benchmark #1: ~/workspace/diskus/target/release/diskus
  Time (mean ± σ):     252.6 ms ±  11.3 ms    [User: 155.4 ms, System: 373.6 ms]
  Range (min … max):   237.0 ms … 272.0 ms    10 runs
 
Benchmark #2: du -sh
  Time (mean ± σ):      1.814 s ±  0.176 s    [User: 116.3 ms, System: 478.4 ms]
  Range (min … max):    1.605 s …  2.065 s    10 runs
 
Benchmark #3: /home/torme/extracts/sn p -d0 -j8
  Time (mean ± σ):     332.9 ms ±  41.9 ms    [User: 98.3 ms, System: 342.0 ms]
  Range (min … max):   303.2 ms … 442.4 ms    10 runs
 
Benchmark #4: dust -d0
  Time (mean ± σ):      1.736 s ±  0.101 s    [User: 191.5 ms, System: 434.2 ms]
  Range (min … max):    1.587 s …  1.915 s    10 runs
 
Summary
  '~/workspace/diskus/target/release/diskus' ran
    1.32 ± 0.18 times faster than '/home/torme/extracts/sn p -d0 -j8'
    6.87 ± 0.51 times faster than 'dust -d0'
    7.18 ± 0.77 times faster than 'du -sh'
hyperfine --warmup 5 '~/workspace/diskus/target/release/diskus' 'du -sh' '/home/torme/extracts/sn p -d0 -j8' 'dust -d0'
Benchmark #1: ~/workspace/diskus/target/release/diskus
  Time (mean ± σ):      65.1 ms ±   6.3 ms    [User: 80.7 ms, System: 141.9 ms]
  Range (min … max):    54.2 ms …  77.7 ms    38 runs
 
Benchmark #2: du -sh
  Time (mean ± σ):     128.0 ms ±   2.8 ms    [User: 45.1 ms, System: 82.0 ms]
  Range (min … max):   125.6 ms … 136.8 ms    23 runs
 
Benchmark #3: /home/torme/extracts/sn p -d0 -j8
  Time (mean ± σ):      51.5 ms ±   5.3 ms    [User: 70.3 ms, System: 115.0 ms]
  Range (min … max):    45.5 ms …  67.2 ms    53 runs
 
Benchmark #4: dust -d0
  Time (mean ± σ):     175.2 ms ±   3.9 ms    [User: 85.0 ms, System: 89.0 ms]
  Range (min … max):   171.6 ms … 187.1 ms    17 runs
 
Summary
  '/home/torme/extracts/sn p -d0 -j8' ran
    1.26 ± 0.18 times faster than '~/workspace/diskus/target/release/diskus'
    2.48 ± 0.26 times faster than 'du -sh'
    3.40 ± 0.36 times faster than 'dust -d0'

@sharkdp
Copy link
Owner

sharkdp commented Oct 19, 2019

Today, I had a chance to look into this. There are many things to consider:

  • I installed the latest version of sn. There doesn't seem to be a big difference. I can still repeat the benchmark from the README on my Dropbox folder (which seems to me like a good actual real world use case, but not very reproducible, of course).
  • I always used 8 threads for sn because that was the fastest option for my initial benchmarks. However, in other scenarios, other thread numbers could work even better. If we optimize this parameter, that's not completely fair, because we do not optimize the -j/--threads parameter for diskus, so far. It just uses its default heuristic, which makes a trade-off between good cold-cache and good warm-cache performance.
  • I used your script to create a 15GB test folder and I can - kind of - reproduce your results (I think you mislabeled warm/cold cache):

cold cache:

Command Mean [ms] Min [ms] Max [ms] Relative
diskus 274.4 ± 8.7 263.6 292.4 1.00
sn p -d0 -j8 425.8 ± 3.4 420.9 431.7 1.55 ± 0.05

warm cache:

Command Mean [ms] Min [ms] Max [ms] Relative
diskus 55.2 ± 11.0 37.3 96.1 2.43 ± 0.49
sn p -d0 -j8 22.7 ± 0.5 21.6 24.8 1.00

If I optimize the number of threads for diskus, I get closer to the performance of sn in the warm-cache benchmark:

Command Mean [ms] Min [ms] Max [ms] Relative
diskus -j7 30.8 ± 0.8 29.7 33.9 1.00
  • However, it's not quite clear to me why sn is faster in this particular case. Note that I discovered two bugs in sn while playing with it, so the two programs are probably not 100% comparable (see tin-summer does not count the size of directories vmchale/tin-summer#24 and tin-summer counts hardlinks twice vmchale/tin-summer#25).

  • The performance for both programs if vastly different from my Dropbox folder. Both programs are much faster on the test folder. The biggest difference I can see is the nesting depth. My Dropbox folder has a depth of ~10 or more, whereas the test folder has a max. depth of 3. But that's just a guess.

  • diskus seems to suffer much more from outside influence, for some reason. Notice the increased standard deviation and the high max runtimes.

In any case, the thing we are really interested in is comparing diskus-master against this feature branch. On the test directory, I get:

cold:

Command Mean [ms] Min [ms] Max [ms] Relative
/tmp/diskus-master 268.6 ± 15.4 250.8 299.2 1.00
/tmp/diskus-feature 270.3 ± 10.8 257.5 295.1 1.01 ± 0.07

warm:

Command Mean [ms] Min [ms] Max [ms] Relative
/tmp/diskus-master 51.4 ± 10.1 33.3 73.2 1.00
/tmp/diskus-feature 52.3 ± 10.7 35.0 71.6 1.02 ± 0.29

For the Dropbox folder, I get:

cold:

Command Mean [s] Min [s] Max [s] Relative
/tmp/diskus-master 1.741 ± 0.020 1.713 1.777 1.01 ± 0.01
/tmp/diskus-feature 1.728 ± 0.014 1.705 1.754 1.00

warm:

Command Mean [ms] Min [ms] Max [ms] Relative
/tmp/diskus-master 323.6 ± 12.8 307.7 344.3 1.00
/tmp/diskus-feature 324.9 ± 19.3 305.2 365.3 1.00 ± 0.07

In conclusion: there is no statistically significant difference between the two, which is great! In fact, we can even run the recently added t-test script from the hyperfine repo, which tells us in both cases:

The two benchmarks are almost the same (p >= 0.05).

src/walk.rs Outdated Show resolved Hide resolved
src/walk.rs Show resolved Hide resolved
Tim Orme added 2 commits October 21, 2019 15:05
This adds better clarity to the return type instead of returning
a tuple of three unidentified u64 values.
@TimOrme
Copy link
Author

TimOrme commented Apr 17, 2020

Digging through my old PR's, are you just waiting on some additional testing for this?

@sharkdp
Copy link
Owner

sharkdp commented Apr 17, 2020

are you just waiting on some additional testing for this?

Yes. I would really like this to be a bit more properly tested before we merge this. If I execute diskus in an empty directory, it currently outputs:

❯ diskus
0 B (0 bytes, 0 files, 1 directories)

It probably counts the current directory (.), but I think that's rather unexpected. Especially since the size is reported as 0 B.

Minor: If there would be one directory, it would be great if it would print "1 directory". Same for files. (Coming to think of it, the same could probably happen for the bytes field)

Next, symlinks are completely ignored. They are reported as neither files or directories. That's technically correct, but probably unexpected as well. The same is true for other filetypes like sockets or pipes. I think we should probably count everything which is not a directory as a "file-like entry". In the output, we could probably still call it "file".

@sharkdp sharkdp closed this Jul 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants