Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use statistics in Faker CTAS #24585

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

nineinchnick
Copy link
Member

@nineinchnick nineinchnick commented Dec 26, 2024

Description

Use statistics when using CREATE TABLE AS SELECT in the Faker connector to:

  • set the default_limit table property to the estimated number of rows from the source table
  • set the min and max column properties based on the statistics
  • detect high-cardinality integer columns and use sequences for them
  • detect low-cardinality columns and generate dictionaries to select values from

Additional context and related issues

Previous attempt #24098 was abandoned after #24147 was reported. This time we only use views for sequence columns, and if this is not very useful, we can avoid creating the views automatically. Or this could be yet another column property.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Faker
* Use statistics when using `CREATE TABLE AS SELECT` in the Faker connector. ({issue}`issuenumber`)

@nineinchnick
Copy link
Member Author

@raunaqmorarka this is the last one, I promise :-)

@nineinchnick
Copy link
Member Author

@raunaqmorarka and @losipiuk this is ready for a review. It's the last one about Faker, I don't have anything else planned for it.

@nineinchnick
Copy link
Member Author

@raunaqmorarka @losipiuk a gentle reminder

Comment on lines +506 to +507
properties.put(ALLOWED_VALUES_PROPERTY, columnValues.get(column.name()).stream()
.map(value -> Literal.format(column.type(), value))
.collect(toImmutableList()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is property size limit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, but the max number of values is configurable, with a reasonable default of 1000. This might throw a some exception for big values, but I don't think we have to prevent that.

When creating a table in the Faker connector from an existing table,
gather column statistics to determine range constraints, set them as
column properties.
When creating a table in the Faker connector from an existing table,
using column statistics determine low cardinality columns, and generate
values from a randomly generated set.
@nineinchnick nineinchnick force-pushed the faker-range-constraint-views branch from c40b9fa to c2f5400 Compare January 18, 2025 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants