Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduction in /metadata API calls for multiple datasets #309

Closed
DenizYil opened this issue Jul 28, 2023 · 8 comments · Fixed by #311
Closed

Reduction in /metadata API calls for multiple datasets #309

DenizYil opened this issue Jul 28, 2023 · 8 comments · Fixed by #311
Assignees
Labels
enhancement New feature or request

Comments

@DenizYil
Copy link
Contributor

For a variety of projects, we've had to query /datasets and for each dataset, query the corresponding metadata using the /metadata endpoint. This often results in a lot of requests (sometimes 200+), which causes a slight delay and some extra load, that could potentially be unavoided.

We save our own data in the metadata column in the metadata table. This is primarily the data we've been interested in, but I believe there have been cases where we've been interested in the other columns (excluding keys).

Would there be a solution where these requests could be grouped into one? I'm happy to elaborate and submit a PR based on any approach that is agreed upon. This would be a beneficial addition to a multitude of internal projects.

@DenizYil DenizYil added the enhancement New feature or request label Jul 28, 2023
@j08lue
Copy link
Collaborator

j08lue commented Jul 31, 2023

For the record - we discussed this a while back (2 years ago), in a Slack thread, though (I just sent you the link). Let us make sure we share the arguments / conclusions here, for posterity.

Btw, this is somewhat related to

in terms of providing optimized interfaces for bulk-retrieving data to populate UI elements for dataset display / selection.

The gist of the discussion 2 years ago was

  1. You may want to do lazy loading in the client instead of retrieving all the data, to avoid long initial loading times?
  2. However, if you have to do this server-side, perhaps you could extend Terracotta for your use case (we discussed a plugin, although we do not have any guidance on how to develop plugins)
  3. This sparse and use-case-specific metadata retrieval sounds like a nice case for GraphQL, but maybe that is way to big a cannon...
  4. @dionhaefner recommended:

    From the top of my head, I would maybe make this endpoint accept a POST request with all requested keys as request body. Perhaps also restrict which entries you want, retrieving >1k geometry footprints when all you need is cloudcover would blow up your response size.

Since then, STAC has seen more uptake and could be an alternative route to explore for exposing metadata-heavy EO data collections in a programmatic way.

@dionhaefner
Copy link
Collaborator

PRs welcome. I would probably accept the design proposed above if implemented cleanly:

make this endpoint accept a POST request with all requested keys as request body. Perhaps also restrict which entries you want, retrieving >1k geometry footprints when all you need is cloudcover would blow up your response size.

@DenizYil
Copy link
Contributor Author

DenizYil commented Aug 1, 2023

During a conversation with Jonas, I proposed introducing a ?metadata=true flag to the /datasets endpoint. These are my key reasons for it:

  • This flag would not require users to make two separate API calls. With the POST option, users would essentially have to pass the response from one endpoint to another to retrieve additional data. If the response body is large (which is often the case), repeatedly sending such large payloads seems inefficient.
  • This approach ensures we don't introduce a new request method, keeping the API's design uniform.
  • From my understanding, implementing ?metadata=true seems to be simple, as I believe we would only need to extend the query with a JOIN if metadata is true. It seems more clean than adding an entirely new endpoint.

I'd love to hear both of your opinions on this approach – and which one you think we should go with. I can submit a PR for either of them depending on what you decide :)

Both of them work for me, and they both solve my issue, so I'm happy either way :)

Thanks a lot for participating in this discussion and for sharing your thoughts.

@dionhaefner
Copy link
Collaborator

I proposed introducing a ?metadata=true flag to the /datasets endpoint

I think I like that. I'm not 100% happy with giving up the strict separations of endpoints (every endpoint only does one thing) but it sure is the simplest way to implement this feature.

I'm not sure if this feature should be opt-in though. I see some potential for DOS exploits by requesting metadata for an excessive amount of datasets. Then again, if we cap the pagination limit it shouldn't be much more expensive than requesting tiles, so it might not matter.

If we decide to make it opt-in, it would be cleaner to go with the POST proposal because we could then conditionally allow / disallow the entire endpoint. I don't think opting into specific query parameters makes for a clean API.

Also, POST would be more flexible in that you can first search for lots of datasets, then only get metadata for some of them.

@j08lue
Copy link
Collaborator

j08lue commented Aug 2, 2023

I'm not 100% happy with giving up the strict separations of endpoints (every endpoint only does one thing)

Same here. How would we call a new bulk metadata retrieval endpoint, though? Just want to see whether we can semantically expand the existing endpoints in a meaningful way...

/more-metadata? /metadata-multi? /metadatas? /metadata-reduce? Or just branch on whether /metadata receives a GET or a POST request?

Also, POST would be more flexible in that you can first search for lots of datasets, then only get metadata for some of them.

Both for dataset selection and metadata property selection, POST seems more clean to me. With GET, you would need to include the column names in the URL, and I guess the list could get long.

Btw, how can a user get the names of available metadata properties (if they only have the endpoint)? By retrieving the metadata for one item and then get the property names from the response?

Another design question: Would we allow for a POST request that gives you data for all datasets (with pagination), i.e. without specifying datasets to retrieve metadata for?

@dionhaefner
Copy link
Collaborator

Or just branch on whether /metadata receives a GET or a POST request?

That was my suggestion. So you'd either have GET example.com/metadata/key1/key2/key3 or POST example.com/metadata. The distinction is unambiguous because in the latter case you're not passing any keys, so they must be in the request body.

Btw, how can a user get the names of available metadata properties (if they only have the endpoint)? By retrieving the metadata for one item and then get the property names from the response?

I guess so. Not perfect but anything else seems like overkill.

Another design question: Would we allow for a POST request that gives you data for all datasets (with pagination), i.e. without specifying datasets to retrieve metadata for?

I see no legitimate use case for this.

@DenizYil
Copy link
Contributor Author

DenizYil commented Aug 3, 2023

Thanks for the comments!

Just to make sure I'm understanding correctly - and we're all on the same page - the PR should include the following:

  • A POST example.com/metadata endpoint, where the payload is an array of objects with a k/v pair used for the filtering and another array of keys for the response.
    • Example:
    {
       "filter":[
          {
             "key1":"val1"
          },
          {
             "key2":"val2",
             "key2-1":"val2-1"
          }
       ],
       "response":[
          "key1",
          "key2",
          "key3"
       ]
    }
    • Please let me know if you think the payload should be different.
    • Are we adding limits to this? Either a hard-coded limit or one the user can specify?
  • This should be opt-in only.
    • Is this a requirement for the PR or something that we should look into doing at a later date? Would there be a major performance difference for this endpoint vs. the /datasets endpoint, in terms of potential for DOS?
    • If this is desired for the PR, how should users be able to opt in/out?
  • Empty arrays should not be allowed.
    • I can see there being a potential use-case for this if the amount of datasets is not large and all the information should be displayed on a viewer. It is probably a niche use case though, but users will likely try to go around it if they need it.

If I'm missing or misunderstanding anything, please let me know. I'll get started on a PR as soon as we agree on the final details – thank you both again :)

@dionhaefner
Copy link
Collaborator

I would just use this payload to start with:

{
    "keys": [
        ["key1a", "key2a", "key3a"],
        ["key1b", "key2b", "key3b"],
        ...
    ]
}

There's no "filtering" involved, the user is expected to call /datasets, then use the response to populate the request body.

As for the desired columns, we can just add these as a query parameter to the /metadata endpoint.

Are we adding limits to this? Either a hard-coded limit or one the user can specify?

This should use the same pagination as /datasets.

Is this a requirement for the PR or something that we should look into doing at a later date? Would there be a major performance difference for this endpoint vs. the /datasets endpoint, in terms of potential for DOS?

I'm not sure. This fetches much more data than /datasets so the queries may get expensive. Then again, pagination would probably alleviate that. We can start implementing the feature then think about that later (maybe after some benchmarking).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants