-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(duckdb): return null typed pyarrow arrays and disable creating tables with all null columns in duckdb #9810
base: main
Are you sure you want to change the base?
Conversation
ACTION NEEDED Ibis follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
8f42f56
to
3784885
Compare
@cpcloud The flink backend is failing, it doesn't like the null type column. I got it to xfail by adding something similar to what you added in In def execute(self, expr: ir.Expr, **kwargs: Any) -> Any:
"""Execute an expression."""
self._register_udfs(expr)
table_expr = expr.as_table()
if null_columns := table_expr.op().schema.null_fields:
raise exc.IbisTypeError(
f"{self.name} cannot yet reliably handle `null` typed columns; "
f"got null typed columns: {null_columns}")
sql = self.compile(table_expr, **kwargs)
df = self._table_env.sql_query(sql).to_pandas() |
@cpcloud im motivated to help this one across the finish line. If I fix the flink tests is this good to merge? Any thoughts regarding my incline comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few thoughts, but thank you very much for this fix! I would love to help push this across the finish line.
- If you point me to what you think needs to happen with flink, I can help
- if you give a thumbs up to my suggestions, I can implement them
@@ -67,6 +67,10 @@ def types(self): | |||
def geospatial(self) -> tuple[str, ...]: | |||
return tuple(name for name, typ in self.fields.items() if typ.is_geospatial()) | |||
|
|||
@attribute | |||
def null_fields(self) -> tuple[str, ...]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for the purpose of this PR we need to recurse into nested types, eg if a structure field is Nulltype we need to deal with that too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the followup, the semantics of this are going to need to change. Since this is internal-only, I think it is fine for this churn to happen, but just writing this down for when I/we implement it.
We use this for two things:
- To check whether we need to do the pyarrow fixup for duckdb, eg if any field or sub-field is null.
- To provide nice error messages when trying to create_table() a table-with-null-types, eg
if null_columns := schema.null_fields:
raise com.IbisTypeError(
f"{self.name} cannot yet reliably handle `null` typed columns; "
f"got null typed columns: {null_columns}"
OK, for these two use cases, consider:
ibis.Schema(
{
"s": "struct<n: null>",
"a": "array<null>",
"m1": "map<string: null>",
"m2": "map<null: string>",
"n": "null"
}
).null_fields
What should this be? Could return something ibis-internal that isn't really machine-interpretable, and only for this error message, that is like jq's DSL, like ("s.n", "a<items>", "m1<values>", "m2<keys>", "n")
? I think that would suffice for our two needs.
reason="unable to handle null typed columns as input", | ||
) | ||
def test_all_null_column(con): | ||
t = ibis.memtable({"a": [None]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a struct<x: null> and an array column to check for these nested types as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow up.
@NickCrews Can you give this another review pass and/or approve? |
@@ -27,6 +28,7 @@ | |||
|
|||
pd = pytest.importorskip("pandas") | |||
pa = pytest.importorskip("pyarrow") | |||
pat = pytest.importorskip("pyarrow.types") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be safe to just do import pyarrow.types as pat
, right? If that gets us better IDE/typing support, I would advocate for that, even if it doesn't match the pa = pytest.importorskip("pyarrow")
lines above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that's a nit, also fine keeping it as is
@cpcloud I tacked on two commits that I think are improvements, if those pass CI and you are happy, looks great to me to merge. I can then, in a followup PR, tackle adding in the nested null support that I describe in this comment. Thanks! |
Actually, hold on, we aren't converting null-typed scalars correctly. Pushing up a fixup commit soon. |
OK, with that I think I'm happy. |
@@ -420,3 +421,24 @@ def test_memtable_doesnt_leak(con, monkeypatch): | |||
df = ibis.memtable({"a": [1, 2, 3]}, name=name).execute() | |||
assert name not in con.list_tables() | |||
assert len(df) == 3 | |||
|
|||
|
|||
def test_create_table_with_nulls(con): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create null-typed columns in other backends?
Address all-NULL column handling in the DuckDB backend.
null
pyarrow Arrays are returned (previously int32), and it is now an error on the Ibis side to create all null columns withcreate_table
in the DuckDB backend. Fixes #9669.