Add reference to dua-cli in the README as similar tool #14

Byron · 2021-05-29T01:26:38Z

As dua is providing both a CLI mode as well as an interactive mode via dua i I placed it into both categories.

Disclaimer: I am the author of this tool and have adapted this paragraph for my own README.

Edit: Now that I got to use pdu a little I finally get to appreciate the way the data is presented. Whereas dua gives a high-level overview and pdu dives in to reveal exactly where the main offenders in terms of disk space usage are. It took me a while and I even wrote my own tool to solve this problem, but finally I can see the benefits of this kind of visualization.

dust never worked for me as it was too slow and…used too much memory, so pdu truly makes a difference here.

Lastly I encourage you to build a TUI which allows the safe deletion of picked items to support the entire workflow people are usually using pdu for.

As `dua` is providing both a CLI mode as well as an interactive mode via `dua i` I placed it into both categories. Disclaimer: I am the author of this tool and have adapted [this paragraph](https://github.com/Byron/dua-cli/blob/60f432417fe2adbbd54de7293f1c3ffcd45365f7/README.md#L168-L181) for my own README.

KSXGitHub

I don't like duplication.

If the main and default interface of dua is CLI, add dua (optional TUI) to the CLI list.
If the main and default interface of dua is TUI, add dua (optional CLI) to the TUI list.

KSXGitHub · 2021-05-29T01:54:41Z

Thanks for your compliment btw 😄. Though I would imagine that pdu uses even more memory than dust (threads and all). Of course, I'm no expert in parallel computing.

Byron · 2021-05-29T02:23:53Z

Thanks for your compliment btw 😄. Though I would imagine that pdu uses even more memory than dust (threads and all). Of course, I'm no expert in parallel computing.

I compared to dua, here is the result of pdu:

real        10.25
user         2.20
sys         56.06
           146489344  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               13239  page reclaims
                  50  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
              160643  voluntary context switches
              313338  involuntary context switches
        154768495577  instructions retired
        153908010215  cycles elapsed
           146114624  peak memory footprint

And here is the one of dua.

real        10.56
user         5.42
sys         26.00
           185794560  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               15260  page reclaims
                  50  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
              207419  voluntary context switches
              355053  involuntary context switches
        196983443197  instructions retired
         90303457631  cycles elapsed
           179029640  peak memory footprint

So, erm, it uses less than a dua which certainly does less, too 🤦‍♂️. It's a bit surprising to me, as dua doesn't have ~~any~~ much state on its own in this mode (using jwalk.

KSXGitHub · 2021-05-29T02:38:52Z

Regarding this line, I see that you intend to push items to aggregates? My advice is "beware of reallocation (vec resizing)", I would attempt to give a size hint (Vec::with_capacity) or switch to another container format.

Lastly I encourage you to build a TUI which allows the safe deletion of picked items to support the entire workflow people are usually using pdu for.

I'm not sure if lazy me is willing to do that. Even the relatively simple CLI doesn't yet have any integration tests of its own because the complexity of setting the environment alone is too much for my lazy ass. Testing a TUI would be nigh-impossible. How could one guarantee the stability of the software without tests?

Finally, would you mind if I add dua to the list of benchmark?

Byron · 2021-05-29T03:30:41Z

Regarding this line, I see that you intend to push items to aggregates?

I believe this is just the top-level that will be listed later - it's not huge and won't show up in any profile.

I'm not sure if lazy me is willing to do that.

Indeed it would be quite some work. If lines-of-code count is any measure, the 4k lines of PDU would certainly go up quite a bit. dua clocks in at 3.4k LOCs for everything Rust, so the ways of pdu seem more complex which probably translates to the TUI as well.
I am lazy too, but once I want something bad enough, I will do it :D. With dua this problem I was having of clearing disk space is solved and it isn't clear that a pdu TUI would do so much better to make it worth the effort.

How could one guarantee the stability of the software without tests?

It's actually straightforward. dua itself is an event-driven engine that has events manipulate its state. This is perfectly testable. The rendering is done with tui which supports a test-backend that can be used with snapshot testing. That way one can unit-test both the application state and the looks perfectly. dua doesn't test the looks, it's something I do visually only. Furthermore it only tests the most important happy-paths of typical user journeys. The latter were written after I had something working which was easy enough to not require unit tests solely to protect against regression.

Finally, would you mind if I add dua to the list of benchmark?

Not at all - it would be nice if you could ping me here as I am curious about the findings. Basically you would be comparing the pdu engine with jwalk which dua relies on.

Byron · 2021-05-29T03:41:47Z

So…I couldn't help myself but to imagine how a TUI could work. But let's start with a question: Is there a way to increase the contrast of these percentage lines? It's so hard to make out on certain levels - probably my main sore point with pdu right now.

The reason I keep thinking about workflow here is what I am usually doing with that data: I want to delete some of it. Even though a full-blown TUI with selection and subsequent (potentially parallel) deletion would be great, maybe there is a way to output the list of paths that it displayed before in a format that makes copy-pasting for deletion easier.

In the end, the user needs complete paths and pdu could provide them. I could imagine running it once for visualization, and another time to get a flat list of paths for removing manually copied ones.

Byron · 2021-05-29T03:49:12Z

You see, I really like pdu because its automating a part of what dua i allows doing: 1) find a list of candidates for deletion, and 2) delete them. 1) is done better in pdu and I have a feeling that with a little tuning it could replace some uses of dua for me.

Right now I would probably run it before dua i to get an idea, and then use dua i to queue the offenders for deletion. The suggestion above with outputting a flat (and parseable) list of items would definitely help automating these tasks.

Maybe something like pdu --list-level 2 | dua i --queue-pdu-list could use the existing TUI of dua to schedule folders on level 2 of the directory tree for deletion. The user would then have to dequeue the ones they don't want to delete.

KSXGitHub · 2021-05-29T03:59:02Z

it isn't clear that a pdu TUI would do so much better to make it worth the effort

Correction: pdu's UI isn't original, I stole it from dust, which in turn stole it from dutree.

I'm probably not going to implement an interactive TUI for deleting files in the near future. However, parallel-disk-usage is also a library crate, the data structures and algorithms for aggregating and visualizing the directory tree are already there. Anyone who wants this feature bad enough could build a tool on top of this library. I am also interested in seeing how it could be done.

Is there a way to increase the contrast of these percentage lines? It's so hard to make out on certain levels - probably my main sore point with pdu right now.

That's the problem of your terminal and/or your fonts.

I use Tilix (which uses VTE under the hood) as my main terminal, and Hack Nerd Font as my font. Here is the screnshot:

I also test the same command on Alacritty, and it looks not as good:

KSXGitHub · 2021-05-29T04:05:45Z

Right now I would probably run it before dua i to get an idea, and then use dua i to queue the offenders for deletion. The suggestion above with outputting a flat (and parseable) list of items would definitely help automating these tasks.

Maybe something like pdu --list-level 2 | dua i --queue-pdu-list could use the existing TUI of dua to schedule folders on level 2 of the directory tree for deletion. The user would then have to dequeue the ones they don't want to delete.

GNU's du can already create a flat, machine-readable list of items. I also plan to add JSON input/output to pdu in the future (it wouldn't be flat however).

Byron · 2021-05-29T04:08:55Z

That's the problem of your terminal and/or your fonts.

Thanks for the hint, I will see how I can get alacritty to display this better then and change the font.

I also plan to add JSON input/output to pdu in the future (it wouldn't be flat however).

Neat, that would be working fine as well as I would implement this specifically to be able to use pdu as part of the processing pipeline. I subscribed to releases to stay informed :).

KSXGitHub · 2021-05-29T04:31:51Z

I also plan to add JSON input/output to pdu in the future (it wouldn't be flat however).

Neat, that would be working fine as well as I would implement this specifically to be able to use pdu as part of the processing pipeline. I subscribed to releases to stay informed :).

Subscribe to #17 and #18 as well.

KSXGitHub · 2021-05-29T04:56:03Z

Finally, would you mind if I add dua to the list of benchmark?

Not at all - it would be nice if you could ping me here as I am curious about the findings. Basically you would be comparing the pdu engine with jwalk which dua relies on.

I am happy to inform you that the benchmark reports is now available as a release artifact

Byron · 2021-05-29T06:43:49Z

Congratulations, it's amazing to see there is still performance to be gained in this field.

To me the sub-second runs don't matter that much, but for bigger trees this really starts to show and a couple of milliseconds become seconds.
On my test-set with 1.44 million files dua takes 12.2s whereas pdu takes 10.3, quite significant.

I took a look at what it would mean to use pdu as engine and noticed this would pull in additional dependencies like clap and thus increase the compile time. To fix this in the current setup, cargo features could be used.

In the meantime I will be waiting for the JSON export feature to land which would allow me to use the greater speeds of figuring out good candidates for deletion with the actual deletion TUI of dua :D.

KSXGitHub · 2021-05-29T06:57:27Z

I took a look at what it would mean to use pdu as engine and noticed this would pull in additional dependencies like clap and thus increase the compile time. To fix this in the current setup, cargo features could be used.

Sound advice. I will be implementing this soon.

KSXGitHub · 2021-05-29T13:08:46Z

I took a look at what it would mean to use pdu as engine and noticed this would pull in additional dependencies like clap and thus increase the compile time. To fix this in the current setup, cargo features could be used.

In version 0.2.0 (which may or may yet be released), the CLI part of parallel-disk-usage library and its dependencies (clap, structopt, etc.) can now be disabled by disabling default features.

KSXGitHub · 2021-06-03T15:51:06Z

Version 0.3.0 has been released. It can now print disk usage data as JSON to stdout as well as visualizing input JSON from stdin. A new benchmark report with the latest version of dua has also been produced.

Byron · 2021-06-04T05:43:23Z

Thanks a lot! I have added an issue to hopefully one day implement a pdu integration.

On another note, can it be that the picture in the README right below the list of program versions used to create it is out of date? The benchmark has run multiple times by now yet the last modified date of it appears as Sun, 30 May 2021 04:24:53 GMT (produced with xh HEAD https://camo.githubusercontent.com/8a2a9497f22a5d1879128c069cfdb8c1679a7f8b620f5077d602b0170c7b5d11/68747470733a2f2f6b73786769746875622e6769746875622e696f2f706172616c6c656c2d6469736b2d75736167652d302e322e342d62656e63686d61726b732f746d702e62656e63686d61726b2d7265706f72742e636f6d706574696e672e626c6b73697a652e737667).

By now I reproduced part of the benchmark run and am curious about what's happening on the CI runner.

Note that both pdu and dua are extremely close and in theory, pdu could be a little faster but now suffers from the same M1 problem that dua would suffer from had I not compiled in a different amount of default threads on Apple Silicon. So dua uses 4 threads whereas pdu uses 8.

hyperfine 'dua --apparent-size tmp.sample' 'pdu tmp.sample' 'du tmp.sample'

Benchmark #1: dua --apparent-size tmp.sample
  Time (mean ± σ):      76.4 ms ±   7.1 ms    [User: 105.4 ms, System: 257.2 ms]
  Range (min … max):    73.0 ms … 111.1 ms    26 runs

  Warning: The first benchmarking run for this command was significantly slower than the rest (111.1 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.

Benchmark #2: pdu tmp.sample
  Time (mean ± σ):      83.5 ms ±   3.9 ms    [User: 81.3 ms, System: 515.8 ms]
  Range (min … max):    81.2 ms … 103.0 ms    28 runs

  Warning: The first benchmarking run for this command was significantly slower than the rest (103.0 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.

Benchmark #3: du tmp.sample
  Time (mean ± σ):     152.9 ms ±   1.5 ms    [User: 11.6 ms, System: 140.7 ms]
  Range (min … max):   150.0 ms … 156.1 ms    19 runs

Summary
  'dua --apparent-size tmp.sample' ran
    1.09 ± 0.11 times faster than 'pdu tmp.sample'
    2.00 ± 0.19 times faster than 'du tmp.sample'

It's interesting how fast du is given that it uses way less system resources, making it the definitive winner per Watt :D.

As for the reason that on CI the world looks different, the only explanation I could pull out of thin air is that hyperfine is run individually on each program whereas it could possibly also be used to produce all output to generate the report from.

KSXGitHub · 2021-06-04T06:26:59Z

As for the reason that on CI the world looks different, the only explanation I could pull out of thin air is that hyperfine is run individually on each program whereas it could possibly also be used to produce all output to generate the report from.

I think it's actually about the way you invoke hyperfine:

hyperfine 'dua --apparent-size tmp.sample' 'pdu tmp.sample' 'du tmp.sample'

There are also warnings:

Warning: The first benchmarking run for this command was significantly slower than the rest (111.1 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.

Warning: The first benchmarking run for this command was significantly slower than the rest (103.0 ms). This could be caused by (filesystem) caches that were not filled until after the first run. You should consider using the '--warmup' option to fill those caches before the actual benchmark. Alternatively, use the '--prepare' option to clear the caches before each timing run.

In the GitHub Workflow files, I always add --warmup=3 to every hyperfine command.

If you also want to also measure cold start, I suggest rebooting after each benchmark.

KSXGitHub · 2021-06-04T06:32:57Z

On another note, can it be that the picture in the README right below the list of program versions used to create it is out of date?

Yes, I have yet to update the benchmark section of the README. But it doesn't actually matter, because the dua's performance doesn't change much in the benchmark performed by pdu 0.3.0 (direct link to the benchmark reports).

There's also 0.4.0, which I have yet checked out.

Byron · 2021-06-05T01:40:24Z

I think it's actually about the way you invoke hyperfine:

It's the lazy way of invoking it, admittedly. Ultimately it's hyperfine who runs the programs hundreds of times to get comparable values still.

If you also want to also measure cold start, I suggest rebooting after each benchmark.

With prepare it's possible to purge the fs cache, on MacOS it would be --prepare purge.

[…] because the dua's performance doesn't change much […]

And that's the last unresolved riddle here. Thus far the arrival of pdu already uncovered a lot of interesting knowledge and as far as I can see also helped fixing a synchronization issue in the pdu progress reporting. From my tests I now that both tools very similar regarding performance and it comes down to milliseconds. dua being consistently slower than single-threaded progams, however, makes no sense to me and I am sure there is more interesting knowledge to be uncovered here.

Please don't get me wrong, to me it matters not who is 'the fastest', but I want to understand what's going on as the benchmark contradicts both my experience and measurements alike.

KSXGitHub · 2021-06-05T01:46:23Z

being consistently slower than single-threaded progams, however, makes no sense to me and I am sure there is more interesting knowledge to be uncovered here.

I am still in disbelief that these fast programs are actually single-threaded.

Please don't get me wrong, to me it matters not who is 'the fastest', but I want to understand what's going on as the benchmark contradicts both my experience and measurements alike.

I didn't intend to make pdu the fastest either. I only wanted a dust with acceptable performance, the fact that it becomes the fastest is unintentional.

Byron · 2021-06-05T01:57:11Z

I am still in disbelief that these fast programs are actually single-threaded.

That's a good point. Last time I tested them on MacOS they were. Maybe that changed. What matters is the version the CI system is using and their threaded-ness should be easy to observe with time.

Just to try one, I downloaded the latest source of ncdu, built it and ran it like this:

➜  ncdu-1.15.1 time ./ncdu  ~/dev
./ncdu ~/dev  0.83s user 17.37s system 56% cpu 32.105 total

It doesn't even saturate a single CPU core, which is quite typical on my system when doing single-threaded filesystem traversals. What matters is the system time (it was stuck in GUI mode for a while longer) so it takes about 18s to traverse what takes dua at its best configuration 10.5s and pdu currently 11.5s due to using 8 threads.

Grepping through the code to look for threading didn't yield results either so I doubt there is a compile flag to turn that on.

KSXGitHub requested changes May 29, 2021

View reviewed changes

Remove dua from TUI list as it is a CLI by default

1e952ee

KSXGitHub approved these changes May 29, 2021

View reviewed changes

KSXGitHub merged commit 0382691 into KSXGitHub:master May 29, 2021

Byron mentioned this pull request May 29, 2021

Discussion: Can jwalk gain from pdus directory walking? Byron/jwalk#30

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reference to dua-cli in the README as similar tool #14

Add reference to dua-cli in the README as similar tool #14

Byron commented May 29, 2021 •

edited

Loading

KSXGitHub left a comment

KSXGitHub commented May 29, 2021

Byron commented May 29, 2021 •

edited

Loading

KSXGitHub commented May 29, 2021

Byron commented May 29, 2021 •

edited

Loading

Byron commented May 29, 2021

Byron commented May 29, 2021 •

edited

Loading

KSXGitHub commented May 29, 2021

KSXGitHub commented May 29, 2021

Byron commented May 29, 2021 •

edited

Loading

KSXGitHub commented May 29, 2021

KSXGitHub commented May 29, 2021 •

edited

Loading

Byron commented May 29, 2021

KSXGitHub commented May 29, 2021

KSXGitHub commented May 29, 2021 •

edited

Loading

KSXGitHub commented Jun 3, 2021 •

edited

Loading

Byron commented Jun 4, 2021

KSXGitHub commented Jun 4, 2021

KSXGitHub commented Jun 4, 2021

Byron commented Jun 5, 2021

KSXGitHub commented Jun 5, 2021

Byron commented Jun 5, 2021

Add reference to dua-cli in the README as similar tool #14

Add reference to dua-cli in the README as similar tool #14

Conversation

Byron commented May 29, 2021 • edited Loading

KSXGitHub left a comment

Choose a reason for hiding this comment

KSXGitHub commented May 29, 2021

Byron commented May 29, 2021 • edited Loading

KSXGitHub commented May 29, 2021

Byron commented May 29, 2021 • edited Loading

Byron commented May 29, 2021

Byron commented May 29, 2021 • edited Loading

KSXGitHub commented May 29, 2021

KSXGitHub commented May 29, 2021

Byron commented May 29, 2021 • edited Loading

KSXGitHub commented May 29, 2021

KSXGitHub commented May 29, 2021 • edited Loading

Byron commented May 29, 2021

KSXGitHub commented May 29, 2021

KSXGitHub commented May 29, 2021 • edited Loading

KSXGitHub commented Jun 3, 2021 • edited Loading

Byron commented Jun 4, 2021

KSXGitHub commented Jun 4, 2021

KSXGitHub commented Jun 4, 2021

Byron commented Jun 5, 2021

KSXGitHub commented Jun 5, 2021

Byron commented Jun 5, 2021

Byron commented May 29, 2021 •

edited

Loading

Byron commented May 29, 2021 •

edited

Loading

Byron commented May 29, 2021 •

edited

Loading

Byron commented May 29, 2021 •

edited

Loading

Byron commented May 29, 2021 •

edited

Loading

KSXGitHub commented May 29, 2021 •

edited

Loading

KSXGitHub commented May 29, 2021 •

edited

Loading

KSXGitHub commented Jun 3, 2021 •

edited

Loading