Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement tunable batching of queries/inserts #40

Closed
chadlwilson opened this issue Oct 25, 2021 · 1 comment · Fixed by #63
Closed

Implement tunable batching of queries/inserts #40

chadlwilson opened this issue Oct 25, 2021 · 1 comment · Fixed by #63
Assignees
Labels
enhancement New feature or request

Comments

@chadlwilson
Copy link
Collaborator

Context / Goal

At time of writing there are no optimizations performed on

  • fetch size for data set queries off source/target
  • insert batching into recce db for source rows
  • update batching into recce db for target rows

After #26 it is likely that we will be in a position to experiment with different strategies from a performance perspective.

Expected Outcome

  • If advantageous, add a default query fetchSize which can be overriden at config level, or dataset level
  • If advantageous, add a default insert/update batchSize and implementation in the stream processing to allow this via R2DBC
    • transactional handling may be required here, and

Out of Scope

Additional context / implementation notes

@chadlwilson chadlwilson added the enhancement New feature or request label Oct 25, 2021
@chadlwilson chadlwilson self-assigned this Nov 12, 2021
chadlwilson added a commit that referenced this issue Nov 16, 2021
chadlwilson added a commit that referenced this issue Nov 16, 2021
Micronaut Data seems to have a lot of limitations trying to deal with EmbeddedId and Embedded so let's see how far we can get without it.
chadlwilson added a commit that referenced this issue Nov 16, 2021
- divides each target row into batches
- batch searches for whether they already exist from source
- if they do, updates them one-by-one (no easy way with Micronaut data to do these updates in bulk just yet)
- any remaining which do not are batch inserted

This still doesn't seem quite right
- large batch sizes cause things to lock up and halt - some kind of exhaustion?
- this leaves the slowest bit the updates to existing rows which are done one-by-one
- if all rows in target are matched to source, the target metadata is missing
chadlwilson added a commit that referenced this issue Nov 16, 2021
…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson added a commit that referenced this issue Nov 16, 2021
chadlwilson added a commit that referenced this issue Nov 16, 2021
Micronaut Data seems to have a lot of limitations trying to deal with EmbeddedId and Embedded so let's see how far we can get without it.
chadlwilson added a commit that referenced this issue Nov 16, 2021
- divides each target row into batches
- batch searches for whether they already exist from source
- if they do, updates them one-by-one (no easy way with Micronaut data to do these updates in bulk just yet)
- any remaining which do not are batch inserted

This still doesn't seem quite right
- large batch sizes cause things to lock up and halt - some kind of exhaustion?
- this leaves the slowest bit the updates to existing rows which are done one-by-one
- if all rows in target are matched to source, the target metadata is missing
chadlwilson added a commit that referenced this issue Nov 16, 2021
…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson added a commit that referenced this issue Nov 16, 2021
…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson added a commit that referenced this issue Nov 16, 2021
chadlwilson added a commit that referenced this issue Nov 16, 2021
Micronaut Data seems to have a lot of limitations trying to deal with EmbeddedId and Embedded so let's see how far we can get without it.
chadlwilson added a commit that referenced this issue Nov 16, 2021
- divides each target row into batches
- batch searches for whether they already exist from source
- if they do, updates them one-by-one (no easy way with Micronaut data to do these updates in bulk just yet)
- any remaining which do not are batch inserted

This still doesn't seem quite right
- large batch sizes cause things to lock up and halt - some kind of exhaustion?
- this leaves the slowest bit the updates to existing rows which are done one-by-one
- if all rows in target are matched to source, the target metadata is missing
chadlwilson added a commit that referenced this issue Nov 16, 2021
…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson added a commit that referenced this issue Nov 16, 2021
chadlwilson added a commit that referenced this issue Nov 16, 2021
Micronaut Data seems to have a lot of limitations trying to deal with EmbeddedId and Embedded so let's see how far we can get without it.
chadlwilson added a commit that referenced this issue Nov 16, 2021
- divides each target row into batches
- batch searches for whether they already exist from source
- if they do, updates them one-by-one (no easy way with Micronaut data to do these updates in bulk just yet)
- any remaining which do not are batch inserted

This still doesn't seem quite right
- large batch sizes cause things to lock up and halt - some kind of exhaustion?
- this leaves the slowest bit the updates to existing rows which are done one-by-one
- if all rows in target are matched to source, the target metadata is missing
chadlwilson added a commit that referenced this issue Nov 16, 2021
…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson added a commit that referenced this issue Nov 17, 2021
Micronaut Data seems to have a lot of limitations trying to deal with `EmbeddedId` and `Embedded` types in general when using JPA style repository methods to be auto-implemented by Micronaut Data. So let's see how far we can get without it. For example, when trying to do bulk finds by Embedded ID, it was not able to correctly understand how to generate the queries correctly. While it's probably a bug, it doesn't seem a major focus right now (micronaut-projects/micronaut-data#594 and micronaut-projects/micronaut-data#768)

The downside of adding a surrogate key is extra data to store+manage, extra index to update when bulk updating etc which is why I'd tried to avoid it originally. Short of this change, the other option would be to switch to Micronaut Data with JPA, with Hibernate underneath rather than using Micronaut Data directly.
chadlwilson added a commit that referenced this issue Nov 17, 2021
- divides each target row into batches
- batch searches for whether they already exist from source
- if they do, updates them one-by-one (haven't done bulk updates with Micronaut Data just yet)
- any remaining which were not found+updated are batch inserted

Still to resolve
- large batch sizes cause things to lock up and halt - some kind of exhaustion, probably of connections
- this leaves the slowest bit the updates to existing rows which are done one-by-one
chadlwilson added a commit that referenced this issue Nov 17, 2021
…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson added a commit that referenced this issue Nov 17, 2021
This does seem to improve performance a lot, albeit perhaps not as much if we had an easy way to do bulk updates on only the necessary columns, rather than using `updateAll`. I believe this would require writing a custom query, or possibly custom SQL with Micronaut Data
chadlwilson added a commit that referenced this issue Nov 17, 2021
chadlwilson added a commit that referenced this issue Nov 17, 2021
Micronaut Data seems to have a lot of limitations trying to deal with `EmbeddedId` and `Embedded` types in general when using JPA style repository methods to be auto-implemented by Micronaut Data. So let's see how far we can get without it. For example, when trying to do bulk finds by Embedded ID, it was not able to correctly understand how to generate the queries correctly. While it's probably a bug, it doesn't seem a major focus right now (micronaut-projects/micronaut-data#594 and micronaut-projects/micronaut-data#768)

The downside of adding a surrogate key is extra data to store+manage, extra index to update when bulk updating etc which is why I'd tried to avoid it originally. Short of this change, the other option would be to switch to Micronaut Data with JPA, with Hibernate underneath rather than using Micronaut Data directly.
chadlwilson added a commit that referenced this issue Nov 17, 2021
- divides each target row into batches
- batch searches for whether they already exist from source
- if they do, updates them one-by-one (haven't done bulk updates with Micronaut Data just yet)
- any remaining which were not found+updated are batch inserted

Still to resolve
- large batch sizes cause things to lock up and halt - some kind of exhaustion, probably of connections
- this leaves the slowest bit the updates to existing rows which are done one-by-one
chadlwilson added a commit that referenced this issue Nov 17, 2021
…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.
chadlwilson added a commit that referenced this issue Nov 17, 2021
This does seem to improve performance a lot, albeit perhaps not as much if we had an easy way to do bulk updates on only the necessary columns, rather than using `updateAll`. I believe this would require writing a custom query, or possibly custom SQL with Micronaut Data
@chadlwilson
Copy link
Collaborator Author

chadlwilson commented Nov 17, 2021

Performance with batching implemented

Environment

  • on dev machine
  • OOTB Postgres containers with no customisation
  • docker has 10 cores available
  • all DBs inside Docker, Recce inside Docker (local docker image build, no Gradle)

Configuration

batchSize = 1000
batchConcurrency = 5 

"source-only" rows = 50000
"target-only" rows = 50000
"both" mismatched rows = 50000
"all" rows = 150000 (50,000 in source only, 50,0000 in target only, 50,000 in both)

Modifying the datasetId in the below for the big-table-postgres scenario
hey -c 1 -n 10 -t 3600 -m POST -d '{ "datasetId": "all" }' -T 'application/json' http://localhost:8080/runs

Reporting 75th percentile results:

Dataset Before Batching After Batching
source rows only 75% in 42s 75% in 8.3s
target rows only 75% in 78s 75% in 16s
both 75% in 112s 75% in 45.5s
all 75% in 192s 75% in 61s

Not done to be looked at later

  • fetchSize from the source/tartget DBs seemed to make no effective difference to timings so was not added at this stafe.
    • It is possible that the relevant R2DBC drivers have a different way of interacting with underlying DB, so perhaps this is not so relevant in an R2DBC world where the streaming can offset any roundtrip latency?
    • Alternatively, it is possible this is more relevant with a DB remote to Recce where latency is more of a factor. This should be looked at later when testing in a more representative environment
  • batchSize was not specifically tuned to find optimal size since the testing was all done locally. Again, such tuning is better with a real deployment of a DB in a given cloud environment, or something like that. 1000 rows was just a guess. It was a bit faster than 100 rows, but more tuning is necessary for a given "real" environment.
  • batchConcurrency was mainly tuned to avoid R2DBC Recce DB connection exhaustion. More testing can/should be done
    • Locally, this tends to push load to the DB during the "target update" stage, with concurrency of 5 leading to 500% CPU usage on the DB container.
    • with parallel recs running (to ensure connection exhaustion can be managed - possibly there needs to be limits on recs that are running in parallel?)
  • the slowest part is the "find rows already loaded from source, then update based on target". Since in a "well reconciled" source/tagret combination this should constitute 100% of the rows, it seems focusing on performance here is important. It might be that the Recce DB reconcilation_record design should be re-looked at to avoid actually having to find+update existing rows. Will summarise that idea in another card.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant