Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataverse_json export contains storageIdentifier, which exposes internal configuration properties #11153

Open
johannes-darms opened this issue Jan 14, 2025 · 2 comments
Labels
Type: Bug a defect

Comments

@johannes-darms
Copy link
Contributor

  • What steps does it take to reproduce the issue?

Export a dataset as datavers_json. For example: https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/EJLMJH

storageIdentifier: "s3://10.7910/DVN/EJLMJH"
datasetVersion : {
storageIdentifier : "s3://10.7910/DVN/EJLMJH"
}
files :[  
{  
dataFile {  
storageIdentifier : "s3://dvn-cloud:193d764e846-2fc4cee84513"
}}]
  • To whom does it occur (all users, curators, superusers)?
    all users

  • What did you expect to happen?
    That the storageIdentifier is not part of the dataverse_json-export

Which version of Dataverse are you using?
6.5

Any related open or closed issues to this bug report?
I found none.

Are you thinking about creating a pull request for this issue?
Yes.

@johannes-darms johannes-darms added the Type: Bug a defect label Jan 14, 2025
@qqmyers
Copy link
Member

qqmyers commented Jan 14, 2025

I agree that this isn't something all users need to see, but this info isn't secret and, along with info like internal database ids, can be useful for tools (and admins) that have out-of-band access to the store or db. You could probably achieve what you want with a custom JSON exporter, but there are other places where this info is available - direct upload and download involve URLs that indicate exactly where data is stored and our JSON format is used in API calls.

FWIW: I basically consider the JSON and OAI_ORE exports as intended for machines versus the DDI, DataCite exports that are intended more for humans.

@johannes-darms
Copy link
Contributor Author

johannes-darms commented Jan 17, 2025

Yes, it's clear that no secret is being revealed here, but it's also not a path that a non-administrator could use directly to access the object or debug storage problems. However, the dataset and dataset version paths exclude the bucket_name, while the data file path includes the bucket_name but omits the dataset prefix. As a result, I need to combine the two to locate the object in question. I also need to reference the configuration to determine the correct storage URL. (This hasn't been tested, but I suspect this issue could become more complex if multiple storage providers are configured, as it would be necessary to guess which one is being used for the dataset).

My main concern is to avoid situations where users retrieve this information via the interface and ask how to use it, only to be told "you can't use that information". The exception would be if the instance supports download or upload redirects (dataverse.files..download-redirect or dataverse.files..upload-redirect), which are disabled by default. (But even then, these links may not be valid.)

If the current implementation remains unchanged, I'd like to update the documentation to ensure that admins are aware that the bucket_name is visible to all API users. This would allow them to choose an appropriate bucket name accordingly.

@pdurbin pdurbin changed the title datavers_json export contains storageIdentifier, which exposes internal configuration properties dataverse_json export contains storageIdentifier, which exposes internal configuration properties Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug a defect
Projects
None yet
Development

No branches or pull requests

2 participants