-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduction in /metadata API calls for multiple datasets #309
Comments
For the record - we discussed this a while back (2 years ago), in a Slack thread, though (I just sent you the link). Let us make sure we share the arguments / conclusions here, for posterity. Btw, this is somewhat related to in terms of providing optimized interfaces for bulk-retrieving data to populate UI elements for dataset display / selection. The gist of the discussion 2 years ago was
Since then, STAC has seen more uptake and could be an alternative route to explore for exposing metadata-heavy EO data collections in a programmatic way. |
PRs welcome. I would probably accept the design proposed above if implemented cleanly:
|
During a conversation with Jonas, I proposed introducing a
I'd love to hear both of your opinions on this approach – and which one you think we should go with. I can submit a PR for either of them depending on what you decide :) Both of them work for me, and they both solve my issue, so I'm happy either way :) Thanks a lot for participating in this discussion and for sharing your thoughts. |
I think I like that. I'm not 100% happy with giving up the strict separations of endpoints (every endpoint only does one thing) but it sure is the simplest way to implement this feature. I'm not sure if this feature should be opt-in though. I see some potential for DOS exploits by requesting metadata for an excessive amount of datasets. Then again, if we cap the pagination limit it shouldn't be much more expensive than requesting tiles, so it might not matter. If we decide to make it opt-in, it would be cleaner to go with the POST proposal because we could then conditionally allow / disallow the entire endpoint. I don't think opting into specific query parameters makes for a clean API. Also, POST would be more flexible in that you can first search for lots of datasets, then only get metadata for some of them. |
Same here. How would we call a new bulk metadata retrieval endpoint, though? Just want to see whether we can semantically expand the existing endpoints in a meaningful way...
Both for dataset selection and metadata property selection, POST seems more clean to me. With GET, you would need to include the column names in the URL, and I guess the list could get long. Btw, how can a user get the names of available metadata properties (if they only have the endpoint)? By retrieving the metadata for one item and then get the property names from the response? Another design question: Would we allow for a POST request that gives you data for all datasets (with pagination), i.e. without specifying datasets to retrieve metadata for? |
That was my suggestion. So you'd either have
I guess so. Not perfect but anything else seems like overkill.
I see no legitimate use case for this. |
Thanks for the comments! Just to make sure I'm understanding correctly - and we're all on the same page - the PR should include the following:
If I'm missing or misunderstanding anything, please let me know. I'll get started on a PR as soon as we agree on the final details – thank you both again :) |
I would just use this payload to start with: {
"keys": [
["key1a", "key2a", "key3a"],
["key1b", "key2b", "key3b"],
...
]
} There's no "filtering" involved, the user is expected to call As for the desired columns, we can just add these as a query parameter to the
This should use the same pagination as
I'm not sure. This fetches much more data than |
For a variety of projects, we've had to query
/datasets
and for each dataset, query the corresponding metadata using the/metadata
endpoint. This often results in a lot of requests (sometimes 200+), which causes a slight delay and some extra load, that could potentially be unavoided.We save our own data in the
metadata
column in themetadata
table. This is primarily the data we've been interested in, but I believe there have been cases where we've been interested in the other columns (excluding keys).Would there be a solution where these requests could be grouped into one? I'm happy to elaborate and submit a PR based on any approach that is agreed upon. This would be a beneficial addition to a multitude of internal projects.
The text was updated successfully, but these errors were encountered: