New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Implement tunable batching of queries/inserts #40

Closed

chadlwilson opened this issue Oct 25, 2021 · 1 comment · Fixed by #63

Assignees

Labels

Collaborator

chadlwilson commented Oct 25, 2021

Context / Goal

At time of writing there are no optimizations performed on

fetch size for data set queries off source/target
insert batching into recce db for source rows
update batching into recce db for target rows

After #26 it is likely that we will be in a position to experiment with different strategies from a performance perspective.

Expected Outcome

If advantageous, add a default query fetchSize which can be overriden at config level, or dataset level
If advantageous, add a default insert/update batchSize and implementation in the stream processing to allow this via R2DBC
- transactional handling may be required here, and

Out of Scope

Additional context / implementation notes

The text was updated successfully, but these errors were encountered:

chadlwilson added the enhancement label

chadlwilson mentioned this issue

Interleave persistence of source + target rows #41

Open

chadlwilson self-assigned this

chadlwilson added a commit that referenced this issue


          #40 Batch inserts of data from source DB to recce DB

84ae670

chadlwilson added a commit that referenced this issue


          #40 add test for bulk lookup

3f83391

chadlwilson added a commit that referenced this issue


          #40 Removing EmbeddedId for referring to RecRecords

e98e510

Micronaut Data seems to have a lot of limitations trying to deal with EmbeddedId and Embedded so let's see how far we can get without it.

chadlwilson added a commit that referenced this issue


          #40 First messy attempt to batch the target inserts

3bb868c

- divides each target row into batches
- batch searches for whether they already exist from source
- if they do, updates them one-by-one (no easy way with Micronaut data to do these updates in bulk just yet)
- any remaining which do not are batch inserted

This still doesn't seem quite right
- large batch sizes cause things to lock up and halt - some kind of exhaustion?
- this leaves the slowest bit the updates to existing rows which are done one-by-one
- if all rows in target are matched to source, the target metadata is missing

chadlwilson added a commit that referenced this issue


          #40 Correct target metadata problems on batch target saves

ae20f5b

chadlwilson added a commit that referenced this issue


          #40 Correct batch loading of data with large batch sizes by limiting …

…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.

chadlwilson added a commit that referenced this issue


          #40 Batch inserts of data from source DB to recce DB

5e88839

chadlwilson added a commit that referenced this issue


          #40 add test for bulk lookup

7e576fe

chadlwilson added a commit that referenced this issue


          #40 Removing EmbeddedId for referring to RecRecords

dab6b14

Micronaut Data seems to have a lot of limitations trying to deal with EmbeddedId and Embedded so let's see how far we can get without it.

chadlwilson added a commit that referenced this issue


          #40 First messy attempt to batch the target inserts

05949f2

- divides each target row into batches
- batch searches for whether they already exist from source
- if they do, updates them one-by-one (no easy way with Micronaut data to do these updates in bulk just yet)
- any remaining which do not are batch inserted

This still doesn't seem quite right
- large batch sizes cause things to lock up and halt - some kind of exhaustion?
- this leaves the slowest bit the updates to existing rows which are done one-by-one
- if all rows in target are matched to source, the target metadata is missing

chadlwilson added a commit that referenced this issue


          #40 Correct target metadata problems on batch target saves

2e9d2fa

chadlwilson added a commit that referenced this issue


          #40 Correct batch loading of data with large batch sizes by limiting …

fe79c50

…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.

chadlwilson mentioned this issue

Implement most of batching #63

Merged

chadlwilson added a commit that referenced this issue


          #40 Correct batch loading of data with large batch sizes by limiting …

222f3be

…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.

chadlwilson added a commit that referenced this issue


          #40 Batch inserts of data from source DB to recce DB

a316fcf

chadlwilson added a commit that referenced this issue


          #40 add test for bulk lookup

5dc744e

chadlwilson added a commit that referenced this issue


          #40 Removing EmbeddedId for referring to RecRecords

1f11173

Micronaut Data seems to have a lot of limitations trying to deal with EmbeddedId and Embedded so let's see how far we can get without it.

chadlwilson added a commit that referenced this issue


          #40 First messy attempt to batch the target inserts

de49086

- divides each target row into batches
- batch searches for whether they already exist from source
- if they do, updates them one-by-one (no easy way with Micronaut data to do these updates in bulk just yet)
- any remaining which do not are batch inserted

This still doesn't seem quite right
- large batch sizes cause things to lock up and halt - some kind of exhaustion?
- this leaves the slowest bit the updates to existing rows which are done one-by-one
- if all rows in target are matched to source, the target metadata is missing

chadlwilson added a commit that referenced this issue


          #40 Correct target metadata problems on batch target saves

f652873

chadlwilson added a commit that referenced this issue


          #40 Correct batch loading of data with large batch sizes by limiting …

d0d6cdd

…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.

chadlwilson added a commit that referenced this issue


          #40 Batch inserts of data from source DB to recce DB

5a16131

chadlwilson added a commit that referenced this issue


          #40 add test for bulk lookup

ee48549

chadlwilson added a commit that referenced this issue


          #40 Removing EmbeddedId for referring to RecRecords

067327d

Micronaut Data seems to have a lot of limitations trying to deal with EmbeddedId and Embedded so let's see how far we can get without it.

chadlwilson added a commit that referenced this issue


          #40 First messy attempt to batch the target inserts

f258d82

- divides each target row into batches
- batch searches for whether they already exist from source
- if they do, updates them one-by-one (no easy way with Micronaut data to do these updates in bulk just yet)
- any remaining which do not are batch inserted

This still doesn't seem quite right
- large batch sizes cause things to lock up and halt - some kind of exhaustion?
- this leaves the slowest bit the updates to existing rows which are done one-by-one
- if all rows in target are matched to source, the target metadata is missing

chadlwilson added a commit that referenced this issue


          #40 Correct target metadata problems on batch target saves

d1b1a64

chadlwilson added a commit that referenced this issue


          #40 Correct batch loading of data with large batch sizes by limiting …

cdf2d26

…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.

chadlwilson added a commit that referenced this issue


          #40 Removing EmbeddedId for referring to RecRecords

7b9c1a7

Micronaut Data seems to have a lot of limitations trying to deal with `EmbeddedId` and `Embedded` types in general when using JPA style repository methods to be auto-implemented by Micronaut Data. So let's see how far we can get without it. For example, when trying to do bulk finds by Embedded ID, it was not able to correctly understand how to generate the queries correctly. While it's probably a bug, it doesn't seem a major focus right now (micronaut-projects/micronaut-data#594 and micronaut-projects/micronaut-data#768)

The downside of adding a surrogate key is extra data to store+manage, extra index to update when bulk updating etc which is why I'd tried to avoid it originally. Short of this change, the other option would be to switch to Micronaut Data with JPA, with Hibernate underneath rather than using Micronaut Data directly.

chadlwilson added a commit that referenced this issue


          #40 First attempt to batch the target inserts

f041201

- divides each target row into batches
- batch searches for whether they already exist from source
- if they do, updates them one-by-one (haven't done bulk updates with Micronaut Data just yet)
- any remaining which were not found+updated are batch inserted

Still to resolve
- large batch sizes cause things to lock up and halt - some kind of exhaustion, probably of connections
- this leaves the slowest bit the updates to existing rows which are done one-by-one

chadlwilson added a commit that referenced this issue


          #40 Correct batch loading of data with large batch sizes by limiting …

cd32d58

…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.

chadlwilson added a commit that referenced this issue


          #40 Batch-save updates to rows to augment with target hashes

f7b3cdd

This does seem to improve performance a lot, albeit perhaps not as much if we had an easy way to do bulk updates on only the necessary columns, rather than using `updateAll`. I believe this would require writing a custom query, or possibly custom SQL with Micronaut Data

chadlwilson closed this as completed in #63

chadlwilson added a commit that referenced this issue


          #40 Batch inserts of data from source DB to recce DB

f1d1ad1

chadlwilson added a commit that referenced this issue


          #40 add test for bulk lookup

d4c9f56

chadlwilson added a commit that referenced this issue


          #40 Removing EmbeddedId for referring to RecRecords

c3b55a3

Micronaut Data seems to have a lot of limitations trying to deal with `EmbeddedId` and `Embedded` types in general when using JPA style repository methods to be auto-implemented by Micronaut Data. So let's see how far we can get without it. For example, when trying to do bulk finds by Embedded ID, it was not able to correctly understand how to generate the queries correctly. While it's probably a bug, it doesn't seem a major focus right now (micronaut-projects/micronaut-data#594 and micronaut-projects/micronaut-data#768)

The downside of adding a surrogate key is extra data to store+manage, extra index to update when bulk updating etc which is why I'd tried to avoid it originally. Short of this change, the other option would be to switch to Micronaut Data with JPA, with Hibernate underneath rather than using Micronaut Data directly.

chadlwilson added a commit that referenced this issue


          #40 First attempt to batch the target inserts

69f36d9

- divides each target row into batches
- batch searches for whether they already exist from source
- if they do, updates them one-by-one (haven't done bulk updates with Micronaut Data just yet)
- any remaining which were not found+updated are batch inserted

Still to resolve
- large batch sizes cause things to lock up and halt - some kind of exhaustion, probably of connections
- this leaves the slowest bit the updates to existing rows which are done one-by-one

chadlwilson added a commit that referenced this issue


          #40 Correct batch loading of data with large batch sizes by limiting …

3a0551a

…the concurrency of `flatMap`

Previously thousands of elements were streaming out of the batches and causing parallel attempts to get DB connections. When combined with updating existing rows, this was causing connection exhaustion and actually deadlocks; I think because different batches were creating a connection to bulk look for existing rows, and then then updating those rows one-by-one, which could lead to deadlock if there was no connection available for them to be able to proceed.

Suspect we still need to do more here (limit concurrency on the update rows flatMap?) or otherwise it's still possible for different rec runs in parallel to deadlock on each other due to connection starvation.

chadlwilson added a commit that referenced this issue


          #40 Batch-save updates to rows to augment with target hashes

658837d

This does seem to improve performance a lot, albeit perhaps not as much if we had an easy way to do bulk updates on only the necessary columns, rather than using `updateAll`. I believe this would require writing a custom query, or possibly custom SQL with Micronaut Data

Collaborator Author

chadlwilson commented Nov 17, 2021 •

edited

Loading

Performance with batching implemented

Environment

on dev machine
OOTB Postgres containers with no customisation
docker has 10 cores available
all DBs inside Docker, Recce inside Docker (local docker image build, no Gradle)

Configuration

batchSize = 1000
batchConcurrency = 5 

"source-only" rows = 50000
"target-only" rows = 50000
"both" mismatched rows = 50000
"all" rows = 150000 (50,000 in source only, 50,0000 in target only, 50,000 in both)

Modifying the datasetId in the below for the big-table-postgres scenario
hey -c 1 -n 10 -t 3600 -m POST -d '{ "datasetId": "all" }' -T 'application/json' http://localhost:8080/runs

Reporting 75th percentile results:

Dataset	Before Batching	After Batching
source rows only	75% in 42s	75% in 8.3s
target rows only	75% in 78s	75% in 16s
both	75% in 112s	75% in 45.5s
all	75% in 192s	75% in 61s

Not done to be looked at later

fetchSize from the source/tartget DBs seemed to make no effective difference to timings so was not added at this stafe.
- It is possible that the relevant R2DBC drivers have a different way of interacting with underlying DB, so perhaps this is not so relevant in an R2DBC world where the streaming can offset any roundtrip latency?
- Alternatively, it is possible this is more relevant with a DB remote to Recce where latency is more of a factor. This should be looked at later when testing in a more representative environment
batchSize was not specifically tuned to find optimal size since the testing was all done locally. Again, such tuning is better with a real deployment of a DB in a given cloud environment, or something like that. 1000 rows was just a guess. It was a bit faster than 100 rows, but more tuning is necessary for a given "real" environment.
batchConcurrency was mainly tuned to avoid R2DBC Recce DB connection exhaustion. More testing can/should be done
- Locally, this tends to push load to the DB during the "target update" stage, with concurrency of 5 leading to 500% CPU usage on the DB container.
- with parallel recs running (to ensure connection exhaustion can be managed - possibly there needs to be limits on recs that are running in parallel?)
the slowest part is the "find rows already loaded from source, then update based on target". Since in a "well reconciled" source/tagret combination this should constitute 100% of the rows, it seems focusing on performance here is important. It might be that the Recce DB reconcilation_record design should be re-looked at to avoid actually having to find+update existing rows. Will summarise that idea in another card.

chadlwilson mentioned this issue

Exploratory performance testing of load and persist #26

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment