Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unasyncify Codegen #99

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 236 additions & 0 deletions draft/unasyncify-codegen.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
======================
DEP XXXX: Unasyncify Codegen
======================

:DEP: XXXX
:Author: Raphael Gaschignard
:Implementation Team: Raphael Gaschignard
:Shepherd: TBD
:Status: Draft
:Type: Feature
:Created: 2024-10-26
:Last-Modified: 2014-10-26

.. contents:: Table of Contents
:depth: 3
:local:

Abstract
========

This DEP proposes to add code generation tooling (Unasyncify codegen) to Django as a way to implement DEP 0009 without duplicating code.

Unasyncify codegen takes an async function and transforms it into a synchronous variant. This involves "eraising" ``await`` s, making sure to call sync versions of async functions found in the function body, and other syntactic-level transformations as needed.

The code generation lets us have one canonical implmentation (the async one) for APIs using this code generation, while still being able to easily add code that is only used in the async or sync variants.

The codegen would be run by Django developers and checked into the codebase. This would allow for complete visibility into what the tool does, and ensure that users of Django are not presented with obfuscated code when looking at internal calls.

Motivation
==========

In order to implement DEP 0009, we want to implement asynchronous APIs for various synchronous APIs. Generally speaking, the color problem means that implementing async APIs and sync APIs involve writing extremely similar-looking code twice.

Code duplication is error-prone, because one has to be very confident that there is no implementation drift between each variant. Having a single canonical version would be helpful!

But if the sync version of ORM APIs are the canonical ones, we cannot take full advantage of what ``async`` offers. Inversely, if the async version of ORM APIs were considered canonical, then sync ORM APIs would likely have to pay a serious performance cost (through something like ``async_to_sync``).

Psycopg is another project facing a similar difficulty for implementing both synchronous and asynchronous APIs without having either side pay a penalty for the other existing.
`Their strategy <https://www.psycopg.org/articles/2024/09/23/async-to-sync/>`_ was to treat the asynchronous version as canonical, and to generate the code for the synchronous version from the async source.
There is no runtime cost for the sync code relative to the async code, but there is only one implementation that needs to be maintained.

Progress on DEP 0009 has meant that async variants of APIs have a certain naming convention. This naming convention, combined with the straightforward nature of async-to-sync generation, makes Django's code base amenable to a purely syntactic transformation.

Specification
=============
A new module is added, ``django.utils.codegen``. Inside it are several decorators:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be a public API that third party modules can use? That would be an interesting extension in the future, but perhaps initially we could move the generate_ functions into a ._unasyncify sub-module to mark them as private, and keep the from_codegen decorator inside the .codegen module?

Moreover, there may be future codegen-related things we add, so moving it to a specific sub-module can keep things cleaner?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • My intention is for this module to be for Django internal usage, and did not mean to imply this was meant to be consumed by third parties
  • I am comfortable with generate_ functions being in a separate ._unasyncify module. I have no strong opinion with having from_codegen being in .codegen or in ._unasyncify
  • I am also comfortable with django.utils._codegen being the namespace here in general

(To further elaborate my feelings regarding third party consumption: I think third party availability should occur after at least two Django internals people work on a third party lib and think "I really want the async codegen from Django". Don't want to couple these internals to publicly facing behavior until we are very confident in this being right)


* ``generate_unasynced(*, async_unsafe=False)``, a decorator which marks functions to be transformed
* ``generate_unasynced_test()``, a decorator which marks test functions to be transformed
* ``from_codegen``, a decorator that marks functions that were generated through unasync codegen.

This module also includes the following:

* ``ASYNC_TRUTH_MARKER``
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is a bit of a mouthful and isn’t super descriptive. Maybe ASYNC_CODEGEN would be better?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am comfortable with ASYNC_CODEGENas a name (though I feel like it's a bit odd... IN_TYPING is clearly about being in the typing context, ASYNC_CODEGEN doesn't feel like that to me).

Maybe ASYNC_CONTEXT? Though context has become a loaded term. IN_ASYNC_VARIANT would be another proposal that feels right to me (I have been using "variant" a lot to describe this problem space)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASYNC_CONTEXT is pretty good - context is overloaded, but I think it would be pretty obvious from the context (😅).

In the vein of IS_TYPING: why not just IS_ASYNC? IN_ASYNC_VARIANT and ASYNC_CODEGEN may be more literally correct, but I think what we’re looking for here is not necessarily the most dictionary-correct term but a term that has a good “developer experience”: something as short as possible that contains the keyword ASYNC, allowing it to “fade into the background” and not cause too much noise.

This is ``True``. But during codegen this gets replaced with ``False``. This allows us to easily mix async and sync "branches" within the same implementation for the times we need it.
This is a bit similar to ``typing.TYPE_CHECKING``.

A new script is added, ``scripts/run_codegen.sh``, that scans Django's source tree for functions that need unasync codegen applied.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this primarily Python module/command rather than a shell script? This would be more extensible and might allow us to automatically run it when executing the test suite during development to have a better iteration loop?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be quite straightforward to make this either a Django management command (though then it would show up in management command lists....) or just a plain-old argparse-y script.

Given libCST itself is Python I'm sure I can just call directly to the library's own tooling rather than any subprocess shenanigans


A function annotated with ``@generate_unasynced()`` is marked to have the unasync codegen applied to it.

* The annotated function must be an ``async def``'d function
* Its name must follow Django's "async variant" naming convention, either beginning with ``a`` or, in the case of internal functions, ``_a``

Unasync codegen will add a new function to the same block, above the existing function implementation.


The generated function has the following changes:

* This function is named as the "sync variant", removing the ``a``
``aconnect`` becomes ``connect``
``_ainternal_method`` becomes ``_internal_method``
* The ``@generate_unasynced`` decorator is removed
* A ``@from_codegen`` decorator is added. This both indicates to the reader of code that this is from code generation, and lets code generation be idempotent
* The generated function is synchronous (``def foo``)
* Certain names are replaced in the body, whether they are used as identifiers or attributes.
Currently the mapping is as follows:

* ``aconnection`` -> ``connection``
(This allows for simple remapping of ``self.aconnection`` to ``self.connection``)
* ``ASYNC_TRUTH_MARKER`` -> ``False``
(``ASYNC_TRUTH_MARKER is True``, but we can use this marker to have code know if it is in the async or sync variant of a function)
* Other names might be added based on needs we discover working on this.
* ``if ASYNC_TRUTH_MARKER`` blocks are flattened in the sync variant
Concreteley, this means that the following::

if ASYNC_TRUTH_MARKER:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make it an error if the conditional is anything except checking this attribute (I.e. if … and something_else), or are there any use cases for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it's worth worrying about trying to figure out "odd" statements (especially given that when you start doing things like generating if False and something_else you'll get into lint-capturing territory), unless we identify things that are both clearly "wrong" and also things that we would be susceptible to accidentally do.

do_thing_a()
else:
do_thing_b()

Gets flattened to just::

do_thing_b()


* ``await``'ed expressions are replaced with non-await versions:

* Within an ``await`` 'ed expression, function calls are examined to see if their name starts with `a`. If so, we replace this with function calls without the `a`

Concretely, this means that ``await objects.aget(foo=bar)`` will get transformed to ``objects.get(foo=bar)``.
This transformation only happens inside of expressions within an ``await``, so something like ``my_dict.add(foo=bar)`` *will not* be transformed.

This transformation also only looks at function calls, and not attributes. This is why we also have a separate transformation to handle rewriting ``aconnection`` to ``connection``.

This does mean that ``await objects.aget(foo=obj.afunc())`` would get transformed to ``objects.get(foo=obj.func())`` (note the change from ``afunc`` to ``func``).
One can avoid this by extracting the call::

result = obj.afunc()
await objects.get(foo=result)

Or one can also use something like ``getattr``::

await objects.get(foo=getattr(obj, 'afunc')())

This sort of workaround is sufficient to avoid having to have any more complicated "opt out of function renaming" issues. See the Rationale section for a note on this function renaming choice.

* ``async for`` loops are replaced with ``for`` loops
* ``async with`` blocks are replaced with ``with`` blocks
* If, inside the ``generate_unasynced`` decorator, we have specified ``async_unsafe=True``, then the generated function will have ``@async_unsafe`` applied to it as well.


What follows is a concrete example of what the transformation generates.

Given the following::

@generate_unasynced(async_unsafe=True)
async def aconnect(self):
"""Connect to the database. Assume that the connection is closed."""
# Check for invalid configurations.
self._pre_connect()
if ASYNC_TRUTH_MARKER:
# Establish the connection
conn_params = self.get_connection_params(for_async=True)
else:
# Establish the connection
conn_params = self.get_connection_params()
self.aconnection = await self.aget_new_connection(conn_params)
await self.aset_autocommit(self.settings_dict["AUTOCOMMIT"])
await self.ainit_connection_state()
connection_created.send(sender=self.__class__, connection=self)

self.run_on_commit = []

The following is added *above the ``aconnect``* definition::

@from_codegen
@async_unsafe
def connect(self):
"""Connect to the database. Assume that the connection is closed."""
# Check for invalid configurations.
self._pre_connect()
# Establish the connection
conn_params = self.get_connection_params()
self.connection = self.get_new_connection(conn_params)
self.set_autocommit(self.settings_dict["AUTOCOMMIT"])
self.init_connection_state()
connection_created.send(sender=self.__class__, connection=self)

self.run_on_commit = []


By running the ``scripts/run_codegen.sh`` script, Django's source tree is scanned for functions with the ``generate_unasync`` decorator, and will rewrite files with that decorator applied according to the above rules.

Developer Flow Changes
======================

With this change, async functions annotated with ``@generate_unasyncify`` will be considered the "canonical" versions, wheras generally (at the time of this writing) the synchronous version has been the canonical version.

Because of this, developers will need to make sure they make changes to the asynchronous versions of functions annotated with ``@generate_unasyncify``, and not make manual changes to functions with ``@from_codegen`` applied.

An added step in CI will make sure that unasyncify codegen is applied. This also will help capture whether manual changes to the synchronous versions are unintentionally committed.

Developers working on annotated code will need to run ``scripts/run_codegen.sh`` and commit changes from this codegen. This has the added benefit of reviewing the result of the codegen, and supervising that the transformation matches what we want.

Transformation Of Test Code
===========================

Tests can be annotated with ``generate_unasynced_test``. The only difference between this and ``@generate_unasynced`` is the naming conventions.

A test function uses the async variant name as the "canonical" name, and adds the ``_sync`` suffix to the sync variant. So ``test_property`` will be transformed to ``test_property_sync``.

An example::

@from_codegen
async def test_thing_sync(self):
self.assertEqual(Model.objects.count(), 0)

@generate_unasynced_test()
async def test_thing(self):
self.assertEqual(await Model.objects.acount(), 0)


Rationale
=========

Factoring out everything but the code flow in a way to minimize code duplication, while doable in a case-by-case way, ultimately means that code would need to be concerned more with async/sync compatibility than with readability as a whole. Factoring out small fragments of code for the sake of async/sync compatibility will make it harder to spot other issues. And even beyond that, the simple act of trying to keep function signatures in sync could lead to issues.

Run-time trickery to try and have a single implementation for both variants bring up the performance question. Load-time transformation of a single implementation would be costly.

Code generation is, fundamentally, legible. Though developers aren't directly writing the generated functions, the results will show up in code review, will be diffed against existing implementations, and won't be obfuscated when looked at by users of Django.

For the specific choice of function call renaming by looking at the name: Using ``a``-prefixed names as a proxy for "async variants of sync APIs" works unreasonably well based on Django's code base. It prevents having to generate a whole list of functions and lets the code transformation remain purely syntactic.

But importantly, this transformation (that happens *only in ``await`` expressions* and *only on names that are being called*) is legible. One can see the transformation happen (because it is checked in), and if someone identifies this issue, they can apply a workaround. Unlike any runtime routing, issues downstream of this rewrite will be visible immediately.

This codegen is aimed at supporting Django's efforts at maintaining these APIs, so we can rely on Django's specific naming conventions. It is not aimed at supporting other project's efforts at maintaining async and sync variants.

Backwards Compatibility
=======================

Because the annotation and transformations associated to it are opt-in, there are no backwards compatibility concerns. Discussion of handling backwards compatibility related to implementing DEP 0009 are out of scope, in the author's opinion, though very important.

Reference Implementation
========================

`This pull request <https://github.com/fcurella/django/pull/4>`_ includes an implementaiton of code generation to move from having sync and async implementations of functions handling database cursors, to a single async implementation (with the sync implementation being derived through code generation).

It does not include the test function generation.

This code generation uses `libCST <https://libcst.readthedocs.io/en/latest/index.html>`_, which allows for code transformations that in particular preserve comments and whitespace layouts.
This implementation was done in a couple of hours, almost entirely thanks to the existence of ``libCST``. The simplicity of the implementation should be an indicator of the feasibility.

Usage of libCST
===============

Usage of ``ast`` directly does not preserve comments (although it does preserve docstrings). ``libCST`` does this, as well as offering "management command"-like functionality to syntax transformations, sidestepping design questions involved in how an AST transformation tool should be designed.

While one could get away with not having line comments inside of functions who are targets for transformations, simply passing in an entire file to ``ast`` for transformation would remove comments from the rest of the file as well! There would need to be some targetted processing of functions to be able to apply only the changes we are interested in.

If preserving comments within functions that are targetted for code generation is a goal, then this strategybecomes unfeasible without ``libCST``.

if ``libCST`` is used, it would only be a requirement for people looking to make changes to annotated functions, in the same way that `gettext` is required only for people looking to make changes to text for translation.

.. rubric:: Footnotes
.. [#color-problem] shortly: I can call sync functions from async functions but not async functions from sync ones. Idea originating from `This blog post <https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/>`_.