Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(duckdb): return null typed pyarrow arrays and disable creating tables with all null columns in duckdb #9810

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Aug 9, 2024

Address all-NULL column handling in the DuckDB backend. null pyarrow Arrays are returned (previously int32), and it is now an error on the Ibis side to create all null columns with create_table in the DuckDB backend. Fixes #9669.

@cpcloud cpcloud added bug Incorrect behavior inside of ibis datatypes Issues relating to ibis's datatypes (under `ibis.expr.datatypes`) duckdb The DuckDB backend labels Aug 9, 2024
Copy link
Contributor

github-actions bot commented Aug 9, 2024

ACTION NEEDED

Ibis follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message.

Please update your PR title and description to match the specification.

@cpcloud cpcloud changed the title fix(duckdb): return null typed pyarrow arrays and disable creating tables with all null columns in duckdb. fix(duckdb): return null typed pyarrow arrays and disable creating tables with all null columns in duckdb Aug 9, 2024
@cpcloud cpcloud requested review from gforsyth and jcrist August 9, 2024 13:30
@cpcloud cpcloud force-pushed the all-nulls branch 2 times, most recently from 8f42f56 to 3784885 Compare August 9, 2024 16:56
@ncclementi
Copy link
Contributor

@cpcloud The flink backend is failing, it doesn't like the null type column. I got it to xfail by adding something similar to what you added in _register_in_memory_table() for other backends, in the execute() for flink, although maybe it could go higher up in the to_pyarrow() method. Looks like for flink we don't have the _register_in_memory_table() method.

In backends/flink/__init__.py :

    def execute(self, expr: ir.Expr, **kwargs: Any) -> Any:
        """Execute an expression."""
        self._register_udfs(expr)

        table_expr = expr.as_table()

        if null_columns := table_expr.op().schema.null_fields:
            raise exc.IbisTypeError(
                f"{self.name} cannot yet reliably handle `null` typed columns; "
                f"got null typed columns: {null_columns}")

        sql = self.compile(table_expr, **kwargs)
        df = self._table_env.sql_query(sql).to_pandas()

@NickCrews
Copy link
Contributor

@cpcloud im motivated to help this one across the finish line. If I fix the flink tests is this good to merge? Any thoughts regarding my incline comments?

@cpcloud cpcloud added this to the 10.0 milestone Sep 23, 2024
@github-actions github-actions bot added tests Issues or PRs related to tests impala The Apache Impala backend postgres The PostgreSQL backend clickhouse The ClickHouse backend mysql The MySQL backend mssql The Microsoft SQL Server backend trino The Trino backend oracle The Oracle backend flink Issues or PRs related to Flink exasol Issues related to the exasol backend risingwave The RisingWave backend labels Sep 23, 2024
Copy link
Contributor

@NickCrews NickCrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts, but thank you very much for this fix! I would love to help push this across the finish line.

  • If you point me to what you think needs to happen with flink, I can help
  • if you give a thumbs up to my suggestions, I can implement them

@@ -67,6 +67,10 @@ def types(self):
def geospatial(self) -> tuple[str, ...]:
return tuple(name for name, typ in self.fields.items() if typ.is_geospatial())

@attribute
def null_fields(self) -> tuple[str, ...]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the purpose of this PR we need to recurse into nested types, eg if a structure field is Nulltype we need to deal with that too...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the followup, the semantics of this are going to need to change. Since this is internal-only, I think it is fine for this churn to happen, but just writing this down for when I/we implement it.

We use this for two things:

  1. To check whether we need to do the pyarrow fixup for duckdb, eg if any field or sub-field is null.
  2. To provide nice error messages when trying to create_table() a table-with-null-types, eg
        if null_columns := schema.null_fields:
            raise com.IbisTypeError(
                f"{self.name} cannot yet reliably handle `null` typed columns; "
                f"got null typed columns: {null_columns}"

OK, for these two use cases, consider:

ibis.Schema(
	{
		"s": "struct<n: null>",
		"a": "array<null>",
		"m1": "map<string: null>",
		"m2": "map<null: string>",
		"n": "null"
	}
).null_fields

What should this be? Could return something ibis-internal that isn't really machine-interpretable, and only for this error message, that is like jq's DSL, like ("s.n", "a<items>", "m1<values>", "m2<keys>", "n")? I think that would suffice for our two needs.

reason="unable to handle null typed columns as input",
)
def test_all_null_column(con):
t = ibis.memtable({"a": [None]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a struct<x: null> and an array column to check for these nested types as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow up.

ibis/backends/duckdb/converter.py Show resolved Hide resolved
@cpcloud
Copy link
Member Author

cpcloud commented Oct 22, 2024

@NickCrews Can you give this another review pass and/or approve?

@@ -27,6 +28,7 @@

pd = pytest.importorskip("pandas")
pa = pytest.importorskip("pyarrow")
pat = pytest.importorskip("pyarrow.types")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be safe to just do import pyarrow.types as pat, right? If that gets us better IDE/typing support, I would advocate for that, even if it doesn't match the pa = pytest.importorskip("pyarrow") lines above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that's a nit, also fine keeping it as is

@NickCrews
Copy link
Contributor

NickCrews commented Oct 22, 2024

@cpcloud I tacked on two commits that I think are improvements, if those pass CI and you are happy, looks great to me to merge. I can then, in a followup PR, tackle adding in the nested null support that I describe in this comment. Thanks!

@NickCrews
Copy link
Contributor

Actually, hold on, we aren't converting null-typed scalars correctly. Pushing up a fixup commit soon.

@NickCrews
Copy link
Contributor

OK, with that I think I'm happy.

@@ -420,3 +421,24 @@ def test_memtable_doesnt_leak(con, monkeypatch):
df = ibis.memtable({"a": [1, 2, 3]}, name=name).execute()
assert name not in con.list_tables()
assert len(df) == 3


def test_create_table_with_nulls(con):
Copy link
Contributor

@NickCrews NickCrews Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create null-typed columns in other backends?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis clickhouse The ClickHouse backend datatypes Issues relating to ibis's datatypes (under `ibis.expr.datatypes`) duckdb The DuckDB backend exasol Issues related to the exasol backend flink Issues or PRs related to Flink impala The Apache Impala backend mssql The Microsoft SQL Server backend mysql The MySQL backend oracle The Oracle backend postgres The PostgreSQL backend risingwave The RisingWave backend tests Issues or PRs related to tests trino The Trino backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: Errors with NULL columns
3 participants