Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

select "best" UMAP layout for clustering #199

Open
kruus opened this issue Jun 30, 2021 · 3 comments
Open

select "best" UMAP layout for clustering #199

kruus opened this issue Jun 30, 2021 · 3 comments

Comments

@kruus
Copy link

kruus commented Jun 30, 2021

Suggestion

Current pixplot clustering uses features from ...['variants][0][...
Often this is the clustering that looks the worst (often lowest n_neighbors embedding),
and in practice rarely agrees clusters I'd like to lasso.

UMAP clustering docs and examples (and experience) suggest a reasonable approach would be to cluster based on

(*) the layout with highest-n-neighbors (and then lowest-min-dist)

So what I do is

  default_hotspots = ""
  umap_vecs = best_umap_clustering_json(layouts=layouts,**kwargs)
  if umap_vecs is not None:
    default_hotspots = get_hotspots(vecs=read_json(umap_vecs, **kwargs), **kwargs)

where best_umap_clustering_json simply does search (*) and returns the filename

https://github.com/kruus/pix-plot/blob/8a1cd231ce20cc075b9cb72c8ebeda97fdfb335c/pixplot/pixplot.py#L1219-L1244

Erik

@kruus
Copy link
Author

kruus commented Jun 30, 2021

P.S. hotspot scrollbar was not appearing in cases where it should. But my styles.css hacks to get the scrollbar to reappear nicely are kinda' ugly :(

@pleonard212
Copy link
Owner

Hi Erik, thanks for your thoughtful comment! This is in line with a discussion we've been having internally about where and when to cluster, given the newly-landed optional hyperparameter arrays (n_neighbors and min_dist) that you can pass at analysis time.

One idea we were kicking around was doing the clustering in the original high-dimensional space (2048). Then each resulting UMAP projection would visually represent, via the hover-on-mouseover affordance, how well that layout captured the clustering that hdbscan saw in the original space. The user could then make a determination, via the two hyperparameter sliders, which projection worked best as a basis to start editing and curating.

I have to admit the above is typed without any personal experience in clustering in such a high-dimensional space, and so it's possible this would take way too long, or would produce nonsense in any 2d projection, etc. A possible hybrid model would be to run a special umap reduction purely for the purposes of clustering, to give hdbscan like 10dims to work with... the points McInnes makes about the differing needs of visualization vs clustering in the documentation you point out are great ones and we should really take that into consideration too. I'm sure @duhaime will chime in here shortly!

@kruus
Copy link
Author

kruus commented Jul 1, 2021

Clustering the hi-D space might be a good option to have available, at least just for comparisons. So far using the highest-D UMAP space "works well for me", as I'm actually interested in seeing the connectivity of the hi-D manifold. For novice users, auto-adding a "special umap reduction", in case a reasonable one cannot be found, seems a nice touch @pleonard212! I emit a warning (that didn't even check min_dist), a less good idea. Related: Issue #36.

I guess a visual consequence of adding a potentially off-grid umap layout would be amalgamating the neighbors and min_dist info into a single layouts slider. The slider would report the 2 values of the (somehow sorted) layout index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants