Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement opportunistic mirror replicator #2510

Closed
wants to merge 1 commit into from

Conversation

masih
Copy link
Member

@masih masih commented Feb 12, 2024

Implement a file store that opportunistically replicates data.

Implement a file store that opportunistically replicates data.
@masih masih requested review from willscott and gammazero February 12, 2024 17:23
Copy link
Collaborator

@gammazero gammazero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, the mirror was designed to talk a separate read and write Filestore to facilitate replication. We just needed a way to configure that. The idea was that once the indexer caught up, then the read and the write store would become the same Filestore.

This replicator sets up another mirror as a alternat source if data if the data is not already present in the destination. The only case where data to be indexed would be available in the destination is if a new indexer is re-indexing.

@@ -13,6 +13,8 @@ type Config struct {
Local LocalConfig
// S3 configures storing files in S3.
S3 S3Config
// Replicator configures opportunistic mirror replication from a source onto a destination.
Replicator *ReplicatorConfig
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the Replicator should be at a higher level in the config hierarchy, as a peer of filestore.Config, and that instead of a ReplicaConfig type, the Source and Destination should each be of type filestore.Config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that implemented originally, and settled on a dedicated config to avoid the opportunity to define recursive replica source/description.

e <- dee
}
return
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary else following return. Same below.

}
return
} else {
break DestinationList
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break not needed, since that is where loop iteration continues naturally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intend of the break is to not continue the iteration, and fall back on listing from source. No?

//
// If Destination returns an error first, then fallback on listing from Source.

{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary brace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only there to separate the scope of working with destination vs source.

Comment on lines +95 to +104
if ok {
c <- dcc
listedAtLeastOneFile = true
} else {
if listedAtLeastOneFile {
return
} else {
break DestinationList
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ok {
c <- dcc
listedAtLeastOneFile = true
} else {
if listedAtLeastOneFile {
return
} else {
break DestinationList
}
}
if ok {
c <- dcc
listedAtLeastOneFile = true
} else if listedAtLeastOneFile {
return
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the suggested commit results in listing from destination to continue, which is not what I intend to do. Unless I have missed something?

// Get Attempts to get from Replicator.Destination first, and if path does nto exist falls back onto Replicator.Source.
// If the path is found at source, it is replicated onto destination first and then the replica is returned.
func (r *Replicator) Get(ctx context.Context, path string) (*File, io.ReadCloser, error) {
dFile, dRc, dErr := r.Destination.Get(ctx, path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filestore is only used during ingestion of new index data, as a faster alternative to reading from the publisher. The only case where data would be available in the destination is if a new indexer is re-indexing. In which case, wouldn't it be better to support multiple sources (that are not the destination) instead of trying to read from the destination as one of the sources?

@masih masih marked this pull request as draft February 12, 2024 18:28
@masih
Copy link
Member Author

masih commented Feb 13, 2024

Further to 1:1 discussion, ingest engine calls Put (and Head by extension) on every processed ad and its entries regardless of where the original ad came from.

This means for the purposes of populating the bare metal mirror we simply need to separate the config for read and write.

Future work will revisit the file store and ingest engine for a more optimal approach where the context of where the ad came from is taken into consideration and therefore we can avoid unnecessary calls to Put.

We should also consider encapsulating much of the complexity here into a simple sync API, such that the user is given an option to configure a list of mirrors with some priority ad the sync-er simply gets the ads back as that is the only thing that the ad sync mechanism is concerned with.

@gammazero
Copy link
Collaborator

Future work will revisit the file store and ingest engine for a more optimal approach where the context of where the ad came from is taken into consideration and therefore we can avoid unnecessary calls to Put.

This has now been implemented.

sync API, such that the user is given an option to configure a list of mirrors with some priority ad the sync-er simply gets the ads back as that is the only thing that the ad sync mechanism is concerned with.

Requested by issue #2613

Closing, since there is nothing more to do in this PR.

@gammazero gammazero closed this May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants