-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement tunable batching of queries/inserts #40
Labels
enhancement
New feature or request
Comments
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
Micronaut Data seems to have a lot of limitations trying to deal with EmbeddedId and Embedded so let's see how far we can get without it.
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
- divides each target row into batches - batch searches for whether they already exist from source - if they do, updates them one-by-one (no easy way with Micronaut data to do these updates in bulk just yet) - any remaining which do not are batch inserted This still doesn't seem quite right - large batch sizes cause things to lock up and halt - some kind of exhaustion? - this leaves the slowest bit the updates to existing rows which are done one-by-one - if all rows in target are matched to source, the target metadata is missing
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
…the concurrency of `flatMap` Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed. Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
Micronaut Data seems to have a lot of limitations trying to deal with EmbeddedId and Embedded so let's see how far we can get without it.
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
- divides each target row into batches - batch searches for whether they already exist from source - if they do, updates them one-by-one (no easy way with Micronaut data to do these updates in bulk just yet) - any remaining which do not are batch inserted This still doesn't seem quite right - large batch sizes cause things to lock up and halt - some kind of exhaustion? - this leaves the slowest bit the updates to existing rows which are done one-by-one - if all rows in target are matched to source, the target metadata is missing
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
…the concurrency of `flatMap` Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed. Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
…the concurrency of `flatMap` Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed. Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
Micronaut Data seems to have a lot of limitations trying to deal with EmbeddedId and Embedded so let's see how far we can get without it.
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
- divides each target row into batches - batch searches for whether they already exist from source - if they do, updates them one-by-one (no easy way with Micronaut data to do these updates in bulk just yet) - any remaining which do not are batch inserted This still doesn't seem quite right - large batch sizes cause things to lock up and halt - some kind of exhaustion? - this leaves the slowest bit the updates to existing rows which are done one-by-one - if all rows in target are matched to source, the target metadata is missing
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
…the concurrency of `flatMap` Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed. Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
Micronaut Data seems to have a lot of limitations trying to deal with EmbeddedId and Embedded so let's see how far we can get without it.
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
- divides each target row into batches - batch searches for whether they already exist from source - if they do, updates them one-by-one (no easy way with Micronaut data to do these updates in bulk just yet) - any remaining which do not are batch inserted This still doesn't seem quite right - large batch sizes cause things to lock up and halt - some kind of exhaustion? - this leaves the slowest bit the updates to existing rows which are done one-by-one - if all rows in target are matched to source, the target metadata is missing
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 16, 2021
…the concurrency of `flatMap` Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed. Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson
added a commit
that referenced
this issue
Nov 17, 2021
Micronaut Data seems to have a lot of limitations trying to deal with `EmbeddedId` and `Embedded` types in general when using JPA style repository methods to be auto-implemented by Micronaut Data. So let's see how far we can get without it. For example, when trying to do bulk finds by Embedded ID, it was not able to correctly understand how to generate the queries correctly. While it's probably a bug, it doesn't seem a major focus right now (micronaut-projects/micronaut-data#594 and micronaut-projects/micronaut-data#768) The downside of adding a surrogate key is extra data to store+manage, extra index to update when bulk updating etc which is why I'd tried to avoid it originally. Short of this change, the other option would be to switch to Micronaut Data with JPA, with Hibernate underneath rather than using Micronaut Data directly.
chadlwilson
added a commit
that referenced
this issue
Nov 17, 2021
- divides each target row into batches - batch searches for whether they already exist from source - if they do, updates them one-by-one (haven't done bulk updates with Micronaut Data just yet) - any remaining which were not found+updated are batch inserted Still to resolve - large batch sizes cause things to lock up and halt - some kind of exhaustion, probably of connections - this leaves the slowest bit the updates to existing rows which are done one-by-one
chadlwilson
added a commit
that referenced
this issue
Nov 17, 2021
…the concurrency of `flatMap` Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed. Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson
added a commit
that referenced
this issue
Nov 17, 2021
This does seem to improve performance a lot, albeit perhaps not as much if we had an easy way to do bulk updates on only the necessary columns, rather than using `updateAll`. I believe this would require writing a custom query, or possibly custom SQL with Micronaut Data
chadlwilson
added a commit
that referenced
this issue
Nov 17, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 17, 2021
chadlwilson
added a commit
that referenced
this issue
Nov 17, 2021
Micronaut Data seems to have a lot of limitations trying to deal with `EmbeddedId` and `Embedded` types in general when using JPA style repository methods to be auto-implemented by Micronaut Data. So let's see how far we can get without it. For example, when trying to do bulk finds by Embedded ID, it was not able to correctly understand how to generate the queries correctly. While it's probably a bug, it doesn't seem a major focus right now (micronaut-projects/micronaut-data#594 and micronaut-projects/micronaut-data#768) The downside of adding a surrogate key is extra data to store+manage, extra index to update when bulk updating etc which is why I'd tried to avoid it originally. Short of this change, the other option would be to switch to Micronaut Data with JPA, with Hibernate underneath rather than using Micronaut Data directly.
chadlwilson
added a commit
that referenced
this issue
Nov 17, 2021
- divides each target row into batches - batch searches for whether they already exist from source - if they do, updates them one-by-one (haven't done bulk updates with Micronaut Data just yet) - any remaining which were not found+updated are batch inserted Still to resolve - large batch sizes cause things to lock up and halt - some kind of exhaustion, probably of connections - this leaves the slowest bit the updates to existing rows which are done one-by-one
chadlwilson
added a commit
that referenced
this issue
Nov 17, 2021
…the concurrency of `flatMap` Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed. Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson
added a commit
that referenced
this issue
Nov 17, 2021
This does seem to improve performance a lot, albeit perhaps not as much if we had an easy way to do bulk updates on only the necessary columns, rather than using `updateAll`. I believe this would require writing a custom query, or possibly custom SQL with Micronaut Data
Performance with batching implemented Environment
Configuration
Modifying the Reporting 75th percentile results:
Not done to be looked at later
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Context / Goal
At time of writing there are no optimizations performed on
After #26 it is likely that we will be in a position to experiment with different strategies from a performance perspective.
Expected Outcome
fetchSize
which can be overriden at config level, or dataset levelbatchSize
and implementation in the stream processing to allow this via R2DBCOut of Scope
Additional context / implementation notes
The text was updated successfully, but these errors were encountered: