Add file count #31

TimOrme · 2019-07-23T23:33:21Z

This is a rudimentary implementation, and just outputs the total number of files found as well. It includes directories in the count as well, and is very naive in its reporting in that sense. We leverage diskus for reporting and this metric is helpful for us.

Haven't submitted many PRs before, so let me know if you need other things added.

This is a rudimentary implementation, and just outputs the total number of files found as well. It includes directories in the count as well, and is very naive in its reporting in that sense.

TimOrme · 2019-07-23T23:52:25Z

Also should note that this might be a breaking change in the sense that the output has changed, and any clients parsing using that output will likely break. Maybe its better to add this as a flag to maintain that? Wasn't sure if the API had been marked as stable yet.

sharkdp · 2019-09-15T19:52:58Z

Thank you very much for your contribution (sorry for the late reply)!

I'd like to keep diskus simple, but this seems like a reasonable addition. Before we merge this, we should

distinguish between files and directories and output both
Make sure that there is no significant performance overhead (see README for information on how to run benchmarks)

Also should note that this might be a breaking change in the sense that the output has changed, and any clients parsing using that output will likely break.

Thank you for mentioning it. I wouldn't be worried about breaking changes in this respect. But now that you mention it, it might make sense to add something like a --plain option which would provide a easily parseable output (probably just the number of bytes). I'll create a new ticket.

TimOrme · 2019-10-09T20:04:43Z

Ive added the change to split files and directories. I'll work on running the hyperfine tests as well. You don't happen to have a script to generate that test data do you?

Also looks like there's been some restructuring so I'll resolve the conflicts here as well in short order. Let me know if you see any issues with the implementation logic here as well though.

sharkdp · 2019-10-09T21:14:36Z

You don't happen to have a script to generate that test data do you?

Unfortunately, no :-(

TimOrme · 2019-10-09T21:16:30Z

No worries, I'll write up a bash script for it as well. Thanks!

TimOrme · 2019-10-09T21:57:42Z

So I'm not exactly sure how to validate these results. sn seems much faster than what is listed on the README.

Warm Cache:

hyperfine --prepare 'sync; echo 3 | sudo tee /proc/sys/vm/drop_caches' '~/workspace/diskus/target/release/diskus' 'du -sh' 'sn p -d0 -j8' 'dust -d0'
Benchmark #1: ~/workspace/diskus/target/release/diskus
  Time (mean ± σ):     286.1 ms ±  56.0 ms    [User: 195.6 ms, System: 418.6 ms]
  Range (min … max):   226.0 ms … 365.3 ms    10 runs
 
Benchmark #2: du -sh
  Time (mean ± σ):      1.854 s ±  0.095 s    [User: 128.1 ms, System: 540.7 ms]
  Range (min … max):    1.661 s …  2.001 s    10 runs
 
Benchmark #3: sn p -d0 -j8
  Time (mean ± σ):     292.2 ms ±  34.0 ms    [User: 74.2 ms, System: 340.6 ms]
  Range (min … max):   250.5 ms … 355.2 ms    10 runs
 
Benchmark #4: dust -d0
  Time (mean ± σ):      2.006 s ±  0.093 s    [User: 267.7 ms, System: 537.7 ms]
  Range (min … max):    1.927 s …  2.200 s    10 runs
 
Summary
  '~/workspace/diskus/target/release/diskus' ran
    1.02 ± 0.23 times faster than 'sn p -d0 -j8'
    6.48 ± 1.31 times faster than 'du -sh'
    7.01 ± 1.41 times faster than 'dust -d0'

Cold Cache:

hyperfine --warmup 5 '~/workspace/diskus/target/release/diskus' 'du -sh' 'sn p -d0 -j8' 'dust -d0'
Benchmark #1: ~/workspace/diskus/target/release/diskus
  Time (mean ± σ):      65.1 ms ±   7.4 ms    [User: 81.1 ms, System: 138.9 ms]
  Range (min … max):    53.4 ms …  87.8 ms    44 runs
 
Benchmark #2: du -sh
  Time (mean ± σ):     129.6 ms ±   3.7 ms    [User: 39.9 ms, System: 88.6 ms]
  Range (min … max):   127.2 ms … 145.1 ms    22 runs
 
Benchmark #3: sn p -d0 -j8
  Time (mean ± σ):      42.1 ms ±   2.2 ms    [User: 42.9 ms, System: 108.0 ms]
  Range (min … max):    39.7 ms …  52.5 ms    67 runs
 
Benchmark #4: dust -d0
  Time (mean ± σ):     249.8 ms ±  52.6 ms    [User: 131.6 ms, System: 114.5 ms]
  Range (min … max):   176.4 ms … 348.3 ms    16 runs
 
Summary
  'sn p -d0 -j8' ran
    1.55 ± 0.19 times faster than '~/workspace/diskus/target/release/diskus'
    3.08 ± 0.18 times faster than 'du -sh'
    5.94 ± 1.29 times faster than 'dust -d0'

However, comparisons to the version installed via cargo look comparable:

hyperfine --prepare 'sync; echo 3 | sudo tee /proc/sys/vm/drop_caches' '~/workspace/diskus/target/release/diskus' 'diskus'
Benchmark #1: ~/workspace/diskus/target/release/diskus
  Time (mean ± σ):     255.0 ms ±  17.1 ms    [User: 164.1 ms, System: 375.7 ms]
  Range (min … max):   232.3 ms … 279.5 ms    10 runs
 
Benchmark #2: diskus
  Time (mean ± σ):     271.3 ms ±  26.4 ms    [User: 178.7 ms, System: 413.6 ms]
  Range (min … max):   240.4 ms … 328.9 ms    10 runs
 
Summary
  '~/workspace/diskus/target/release/diskus' ran
    1.06 ± 0.13 times faster than 'diskus'

And

hyperfine --warmup 5 '~/workspace/diskus/target/release/diskus' 'diskus'
Benchmark #1: ~/workspace/diskus/target/release/diskus
  Time (mean ± σ):      65.6 ms ±   7.6 ms    [User: 84.7 ms, System: 137.3 ms]
  Range (min … max):    55.3 ms …  81.2 ms    36 runs
 
Benchmark #2: diskus
  Time (mean ± σ):      68.3 ms ±   8.1 ms    [User: 83.0 ms, System: 149.0 ms]
  Range (min … max):    57.0 ms …  93.1 ms    43 runs
 
Summary
  '~/workspace/diskus/target/release/diskus' ran
    1.04 ± 0.17 times faster than 'diskus'

TimOrme · 2019-10-09T21:58:16Z

Also, for posterity, here's the script I used to generate the test directory:

#!/bin/bash -e

# Generates a rather naive testing directory
# Aims for 400,400 files totalling 15GB
# Nested directories totalling 100,100 (plus one for the top dir.)

BASE_DIR=$1

# 15GB/400,000 ~= 370KB
FILE_SIZE_IN_KB=370

for i in {1..100}; do
    mkdir $BASE_DIR/$i

    for f in {1..4}; do
        head -c ${FILE_SIZE_IN_KB}K </dev/urandom > $BASE_DIR/$i/$f.txt
    done;

    for j in {1..100}; do
         mkdir $BASE_DIR/$i/$j

        for f in {1..4}; do
            head -c ${FILE_SIZE_IN_KB}K </dev/urandom > $BASE_DIR/$i/$j/$f.txt
        done;
    done;
done;

sharkdp · 2019-10-09T22:03:42Z

Which version of sn did you use? I have

> sn -V
The Tin Summer 1.16.4

I'm pretty sure that's the version I used for the benchmark (which I updated recently). I should have documented the exact versions.

It might also be your disk / machine or the test directory. I can try to run the benchmark again with your test directory.

TimOrme · 2019-10-09T22:16:37Z

Ah I have a newer version:

bash$ sn -V
The Tin Summer 1.21.8

I'm not actually able to install 1.16.4 due to some feature restrictions in my rust version. If you're able to run tests on your end that'd be great.

TimOrme · 2019-10-10T19:02:44Z

Realized there were bin releases of it. Even with the old version I still see better performance with sn compared to the README. Maybe my test folders are different, or something else system-wise is?

/home/torme/extracts/sn --version
The Tin Summer 1.16.4

hyperfine --prepare 'sync; echo 3 | sudo tee /proc/sys/vm/drop_caches' '~/workspace/diskus/target/release/diskus' 'du -sh' '/home/torme/extracts/sn p -d0 -j8' 'dust -d0'
Benchmark #1: ~/workspace/diskus/target/release/diskus
  Time (mean ± σ):     252.6 ms ±  11.3 ms    [User: 155.4 ms, System: 373.6 ms]
  Range (min … max):   237.0 ms … 272.0 ms    10 runs
 
Benchmark #2: du -sh
  Time (mean ± σ):      1.814 s ±  0.176 s    [User: 116.3 ms, System: 478.4 ms]
  Range (min … max):    1.605 s …  2.065 s    10 runs
 
Benchmark #3: /home/torme/extracts/sn p -d0 -j8
  Time (mean ± σ):     332.9 ms ±  41.9 ms    [User: 98.3 ms, System: 342.0 ms]
  Range (min … max):   303.2 ms … 442.4 ms    10 runs
 
Benchmark #4: dust -d0
  Time (mean ± σ):      1.736 s ±  0.101 s    [User: 191.5 ms, System: 434.2 ms]
  Range (min … max):    1.587 s …  1.915 s    10 runs
 
Summary
  '~/workspace/diskus/target/release/diskus' ran
    1.32 ± 0.18 times faster than '/home/torme/extracts/sn p -d0 -j8'
    6.87 ± 0.51 times faster than 'dust -d0'
    7.18 ± 0.77 times faster than 'du -sh'

hyperfine --warmup 5 '~/workspace/diskus/target/release/diskus' 'du -sh' '/home/torme/extracts/sn p -d0 -j8' 'dust -d0'
Benchmark #1: ~/workspace/diskus/target/release/diskus
  Time (mean ± σ):      65.1 ms ±   6.3 ms    [User: 80.7 ms, System: 141.9 ms]
  Range (min … max):    54.2 ms …  77.7 ms    38 runs
 
Benchmark #2: du -sh
  Time (mean ± σ):     128.0 ms ±   2.8 ms    [User: 45.1 ms, System: 82.0 ms]
  Range (min … max):   125.6 ms … 136.8 ms    23 runs
 
Benchmark #3: /home/torme/extracts/sn p -d0 -j8
  Time (mean ± σ):      51.5 ms ±   5.3 ms    [User: 70.3 ms, System: 115.0 ms]
  Range (min … max):    45.5 ms …  67.2 ms    53 runs
 
Benchmark #4: dust -d0
  Time (mean ± σ):     175.2 ms ±   3.9 ms    [User: 85.0 ms, System: 89.0 ms]
  Range (min … max):   171.6 ms … 187.1 ms    17 runs
 
Summary
  '/home/torme/extracts/sn p -d0 -j8' ran
    1.26 ± 0.18 times faster than '~/workspace/diskus/target/release/diskus'
    2.48 ± 0.26 times faster than 'du -sh'
    3.40 ± 0.36 times faster than 'dust -d0'

Fix merges

sharkdp · 2019-10-19T22:00:37Z

Today, I had a chance to look into this. There are many things to consider:

I installed the latest version of sn. There doesn't seem to be a big difference. I can still repeat the benchmark from the README on my Dropbox folder (which seems to me like a good actual real world use case, but not very reproducible, of course).
I always used 8 threads for sn because that was the fastest option for my initial benchmarks. However, in other scenarios, other thread numbers could work even better. If we optimize this parameter, that's not completely fair, because we do not optimize the -j/--threads parameter for diskus, so far. It just uses its default heuristic, which makes a trade-off between good cold-cache and good warm-cache performance.
I used your script to create a 15GB test folder and I can - kind of - reproduce your results (I think you mislabeled warm/cold cache):

cold cache:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`diskus`	274.4 ± 8.7	263.6	292.4	1.00
`sn p -d0 -j8`	425.8 ± 3.4	420.9	431.7	1.55 ± 0.05

warm cache:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`diskus`	55.2 ± 11.0	37.3	96.1	2.43 ± 0.49
`sn p -d0 -j8`	22.7 ± 0.5	21.6	24.8	1.00

If I optimize the number of threads for diskus, I get closer to the performance of sn in the warm-cache benchmark:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`diskus -j7`	30.8 ± 0.8	29.7	33.9	1.00

However, it's not quite clear to me why sn is faster in this particular case. Note that I discovered two bugs in sn while playing with it, so the two programs are probably not 100% comparable (see tin-summer does not count the size of directories vmchale/tin-summer#24 and tin-summer counts hardlinks twice vmchale/tin-summer#25).
The performance for both programs if vastly different from my Dropbox folder. Both programs are much faster on the test folder. The biggest difference I can see is the nesting depth. My Dropbox folder has a depth of ~10 or more, whereas the test folder has a max. depth of 3. But that's just a guess.
diskus seems to suffer much more from outside influence, for some reason. Notice the increased standard deviation and the high max runtimes.

In any case, the thing we are really interested in is comparing diskus-master against this feature branch. On the test directory, I get:

cold:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`/tmp/diskus-master`	268.6 ± 15.4	250.8	299.2	1.00
`/tmp/diskus-feature`	270.3 ± 10.8	257.5	295.1	1.01 ± 0.07

warm:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`/tmp/diskus-master`	51.4 ± 10.1	33.3	73.2	1.00
`/tmp/diskus-feature`	52.3 ± 10.7	35.0	71.6	1.02 ± 0.29

For the Dropbox folder, I get:

cold:

Command	Mean [s]	Min [s]	Max [s]	Relative
`/tmp/diskus-master`	1.741 ± 0.020	1.713	1.777	1.01 ± 0.01
`/tmp/diskus-feature`	1.728 ± 0.014	1.705	1.754	1.00

warm:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`/tmp/diskus-master`	323.6 ± 12.8	307.7	344.3	1.00
`/tmp/diskus-feature`	324.9 ± 19.3	305.2	365.3	1.00 ± 0.07

In conclusion: there is no statistically significant difference between the two, which is great! In fact, we can even run the recently added t-test script from the hyperfine repo, which tells us in both cases:

The two benchmarks are almost the same (p >= 0.05).

src/walk.rs

This adds better clarity to the return type instead of returning a tuple of three unidentified u64 values.

TimOrme · 2020-04-17T01:09:05Z

Digging through my old PR's, are you just waiting on some additional testing for this?

sharkdp · 2020-04-17T19:56:18Z

are you just waiting on some additional testing for this?

Yes. I would really like this to be a bit more properly tested before we merge this. If I execute diskus in an empty directory, it currently outputs:

❯ diskus
0 B (0 bytes, 0 files, 1 directories)

It probably counts the current directory (.), but I think that's rather unexpected. Especially since the size is reported as 0 B.

Minor: If there would be one directory, it would be great if it would print "1 directory". Same for files. (Coming to think of it, the same could probably happen for the bytes field)

Next, symlinks are completely ignored. They are reported as neither files or directories. That's technically correct, but probably unexpected as well. The same is true for other filetypes like sockets or pipes. I think we should probably count everything which is not a directory as a "file-like entry". In the output, we could probably still call it "file".

Add file count

6dfd778

This is a rudimentary implementation, and just outputs the total number of files found as well. It includes directories in the count as well, and is very naive in its reporting in that sense.

Split directories and files in reporting

20e898e

Tim Orme and others added 4 commits October 10, 2019 14:53

Fix merge conflicts from master

4e3819c

Update tests and doc strings to include new fields

c1eff04

Merge pull request #2 from TimOrme/fix_merges

3d939f0

Fix merges

Merge branch 'master' into add_file_count

a804110

sharkdp reviewed Oct 19, 2019

View reviewed changes

src/walk.rs Outdated Show resolved Hide resolved

sharkdp reviewed Oct 19, 2019

View reviewed changes

src/walk.rs Show resolved Hide resolved

Tim Orme added 2 commits October 21, 2019 15:05

Add struct for walk result

81293a8

This adds better clarity to the return type instead of returning a tuple of three unidentified u64 values.

Fixing tests for walk result

6bd68d2

sharkdp closed this Jul 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add file count #31

Add file count #31

TimOrme commented Jul 23, 2019

TimOrme commented Jul 23, 2019

sharkdp commented Sep 15, 2019

TimOrme commented Oct 9, 2019

sharkdp commented Oct 9, 2019

TimOrme commented Oct 9, 2019

TimOrme commented Oct 9, 2019

TimOrme commented Oct 9, 2019

sharkdp commented Oct 9, 2019

TimOrme commented Oct 9, 2019

TimOrme commented Oct 10, 2019

sharkdp commented Oct 19, 2019

TimOrme commented Apr 17, 2020

sharkdp commented Apr 17, 2020 •

edited

Loading

Add file count #31

Add file count #31

Conversation

TimOrme commented Jul 23, 2019

TimOrme commented Jul 23, 2019

sharkdp commented Sep 15, 2019

TimOrme commented Oct 9, 2019

sharkdp commented Oct 9, 2019

TimOrme commented Oct 9, 2019

TimOrme commented Oct 9, 2019

TimOrme commented Oct 9, 2019

sharkdp commented Oct 9, 2019

TimOrme commented Oct 9, 2019

TimOrme commented Oct 10, 2019

sharkdp commented Oct 19, 2019

TimOrme commented Apr 17, 2020

sharkdp commented Apr 17, 2020 • edited Loading

sharkdp commented Apr 17, 2020 •

edited

Loading