Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling the same image filename in multiple folders #186

Open
vdet opened this issue Apr 26, 2021 · 9 comments
Open

Handling the same image filename in multiple folders #186

vdet opened this issue Apr 26, 2021 · 9 comments

Comments

@vdet
Copy link

vdet commented Apr 26, 2021

Hi again,

I tried to display datasets up to ~600.000 images (pre-computed UMAP). Here is some feedback:

  • pixplot.py ran without a glitch.
  • At first the data failed to load into the http server. I got from the GitHub project page that smaller cell_size could help, and it did. Any guidance about how to set --cell_size as a function of the number and size of input images?
  • for the 600.000 images (224x224px^2), if --cell_size what set small enough the data where loaded by the server. But only a subset of the input images, possibly 1/2 of them, showed up in PixPlot. Is that expected? If it is, perhaps the images to be displayed could be selected at random.
  • The max cell size limit of 1.2 was too small in the geographic view (may be scaled with respect with the map size. This also applies to other hardcoded point_size parameters).

This being said I could already get insight from the subset of image I could view. Thanks!

Vincent

@duhaime
Copy link
Contributor

duhaime commented Apr 26, 2021

Many thanks for this @vdet! Now this is interesting. We haven't tested with plots this large in a little while!

We don't have much guidance on the cell_size argument yet. One of the relevant factors in that consideration would be how long you can wait for the plots to load--if you don't mind waiting a bit, you could afford to use a larger cell size, but you may end up downloading a few hundred MB while fetching the atlas files, aka the large images that each contain many small images, stored in ./output/data/atlases/{{ plot id }}/atlas-{{ atlas index }}.jpg. If you want others to access your plots, I'd aim for something smaller than the default. Possibly 8 or 10 px? The original Google Arts & Culture project used 16px of width as the constraining dimension, and they squeezed ~400,000 images into their viewer!

I'm curious as to the details of your images. All of the images should have been plotted in the scene. But some images may have been filtered out during the data prep stage. As you may know, when the user provides their list of images to be processed, we iterate through them and resize them into little images that have height cell_size and the width required to keep the image's aspect ratio.

If an image has 0 width after being resized (or the image has a height or width that's larger than the atlas size, which is 2048px by 2048px) it won't be retained in the atlases or the plot. I have a hunch that filter is what's causing the missing images. You can check the full list of images that were processed in ./output/data/imagelists/imagelist-{{ plot_id }}.json

The idea of using non-integer cell sizes is interesting! I'm actually not sure what would happen in that case. But I would probably use a cell size of 6 or 8 at a minimum...

@duhaime
Copy link
Contributor

duhaime commented Apr 26, 2021

Just thinking a little more about this @vdet there are some other considerations that might be at play. If the number of images in the output image list is greater than the number of displayed images, it could be because of limitations in the GPU card of the host that's running the visualization.

You can get some information on your system's GPU card here. If the atlases are too big, one can bump up against the Max Texture Image Units value. Or possibly we are packing more into a single draw call than your GPU can handle (there are a few GL parameters that influence the max draw call size).

To alleviate concern over any of these lower-level issues, I'd start by counting the images in the generated image list (alternatively, you could check data.cells.length from your browser console). If that number is reduced, then it's case closed. Else we may need a little more information to figure out why some images are not appearing!

@vdet
Copy link
Author

vdet commented Apr 27, 2021

The images have all the same size: 224 x 224 px^2. I actually tried to set --cell_size to 8 when I saw that images were discarded with 16, but it made no difference as far as I could tell from visual inspection. The file ./output/data/imagelists/imagelist-{{ plot_id }}.json reports all images (592106) I provided. Now I am no GPU expert. The WeGL reports indicates:

Max Texture Size: | 8192
Max Cube Map Texture Size: | 8192
Max Combined Texture Image Units: | 80
Max Anisotropy: | 16

The GPU is an AMD Radeon R9 M395 in a late 2015 27" retina iMac.

All the best,
Vincent

@vdet
Copy link
Author

vdet commented May 19, 2021

Hello,
The issue of images not displayed was caused by a basic name conflict problem: if my images are organized like this

dir_A/img.png
dir_B/img.png #not the same image as dir_A/img.png

and pixplot.py is invoked with --image "*/img.png", only one img.png, not both, will be in the relevant data subdirectories. These subdirs have flat structures and the different img.png are overwritten. This would be avoided would pixplot image storage dir structure mirror the one provided by the user.

All the best,
Vincent

@duhaime
Copy link
Contributor

duhaime commented May 20, 2021

@vdet many thanks for your note. Yes this is something we've thought about. The only challenge is that the filename serves as the foreign key that lets us create the connections between images and metadata rows. This means that if we process the full or relative image path when processing images, the user would have to include that path in their metadata, which could be quite challenging.

What if we displayed a little warning indicating that there are duplicate filenames in the input dataset--would that be sufficient? I'm open to other ideas instead!

@vdet
Copy link
Author

vdet commented May 20, 2021

Hi Douglas,

This means that if we process the full or relative image path when processing images, the user would have to include that path in their metadata, which could be quite challenging.

The user does provide a path already: to be able to use an image glob of the form --image "*/img.png", I specified the full paths to each image, dir_A/img.png,dir_B/img.png, etc., the filename column in my metadata file. Otherwise, there would be no way for pixplot to match the images and the metadata in the first place. Any user who wish to use a non-flat dir structure must of course be able to establish that mapping. At some point pixplot.py removes the paths in the metadata's filename column. Why not leave that filename column intact and mirroring the user dir hierarchy for the image-specific files?

Note that non-flat dir structures naturally occur in many contexts, when

  • combining several image collections
  • combining, to compare them, several versions of the same images

Well I of course don't understand the inside out of your code, and can find an upstream fix for my application. Anyhow, a warning would sure help, I my case the effect was massive I could not miss it. In other cases some a few image will be missing while others will have the wrong metadata and land in the wrong spot in the geographic view (what helped my pinpoint the problem).

Thanks,

Vincent

@duhaime
Copy link
Contributor

duhaime commented May 20, 2021

@vdet aha! Your note is very helpful. Right now, we process just the "basename" of each image and map that to the basename of the image specified in the metadata inputs (source).

The motivation for using the file basenames was to allow users to create a static representation of their metadata that doesn't include relative or fully-qualified paths, as either of the latter would be cumbersome if one were to move the data.

Perhaps we should check to see whether the images attribute in the metadata contains path demarcators and if so, join on the paths specified? How does that sound?

@vdet
Copy link
Author

vdet commented May 21, 2021

Hi Douglas,

The motivation for using the file basenames was to allow users to create a static representation of their metadata that doesn't include relative or fully-qualified paths, as either of the latter would be cumbersome if one were to move the data.

They can put all their dirs in one big dir if they need to move all images at once. This is actually what I did in my real life application with a root image dir containing sub dirs.

Perhaps we should check to see whether the images attribute in the metadata contains path demarcators and if so, join on the paths specified? How does that sound?

Not sure I understand what you mean. I'd suggest to leave the metadata filename column intact, use it as image path, and mirror the user dir structure in data/thumbs, etc.

All the best,

Vincent

@duhaime duhaime changed the title Attempt to display 600,000 images Handling the same image filename in multiple folders Jul 19, 2021
@duhaime
Copy link
Contributor

duhaime commented Jul 19, 2021

I changed the title of this issue to better reflect the open task...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants