Skip to content

Remove global from task_runner supervisor-comms#59876

Open
jscheffl wants to merge 11 commits intoapache:mainfrom
jscheffl:bugfix/remove-global-from-task-runner-supervisor-comms
Open

Remove global from task_runner supervisor-comms#59876
jscheffl wants to merge 11 commits intoapache:mainfrom
jscheffl:bugfix/remove-global-from-task-runner-supervisor-comms

Conversation

@jscheffl
Copy link
Contributor

@jscheffl jscheffl commented Dec 28, 2025

Another small (in this case rather medium complex) increment to remove global statements for PR #58116

This removes 2 global statements from task_runner.py where explicitly a global wariable was used as shared SUPERVISOR_COMMS by intent. Proposing to change with via a static class and accessor-methods to prevent usage of global variables.

global is evil.

For this PR to merge seeking for explicit approval from one of the Task SDK creators @ashb, @kaxil and/or @amoghrajesh

@jscheffl jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from d1fdaaa to 4bb5081 Compare December 29, 2025 08:44
@jscheffl jscheffl added full tests needed We need to run full set of tests for this PR to merge all versions If set, the CI build will be forced to use all versions of Python/K8S/DBs labels Dec 29, 2025
@jscheffl jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from 4bb5081 to 780ce42 Compare December 29, 2025 12:58
@jscheffl jscheffl marked this pull request as ready for review December 29, 2025 12:58
Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo globals can be bad, but this specific use case is one of the legitimate ones, in fact better than the approach without globals.

def supervisor_comms() -> CommsDecoder[ToTask, ToSupervisor]:
    return _SupervisorCommsHolder.comms

But then violates it in supervisor.py:1669:

task_runner._SupervisorCommsHolder.comms = temp_comms  # 

and every call site becomes verbose:

Before:

SUPERVISOR_COMMS.send(msg)

After (unnecessary function call)

supervisor_comms().send(msg)

Every access now requires a function call + None check instead of direct variable access.

My 2 cents :)

# If this is set it means are in some kind of execution context (Task, Dag Parse or Triggerer perhaps)
# and should use the Task SDK API server path
if hasattr(sys.modules.get("airflow.sdk.execution_time.task_runner"), "SUPERVISOR_COMMS"):
from airflow.sdk.execution_time.task_runner import is_supervisor_comms_initialized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this used hasattr on sys so we don't need to have airflow.sdk installed on the server components

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think we should solve it differently, I would say here just checking if "task-sdk" is installed should be a better check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or smth else - but this should be revised after we complete the isolation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, have reverted/adjusted to previous state - as it was repeated Copy&Paste have moved it into a utility in https://github.com/apache/airflow/pull/59876/changes#diff-7694d13e2f87c84d20b0b8b44797bf96d754ae270204217e518082decc74649bR104

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would leave other improvements of detection to another PR.

@potiuk
Copy link
Member

potiuk commented Dec 29, 2025

Every access now requires a function call + None check instead of direct variable access.

I think if we add @cache to def supervisor_comms(), this will be faster and will only require a hash lookup - and then, I think performance is not a concern any more.

I like this pattarn that Jens introduced a lot more to be honest, it is more unit-test friendly IMHO.

Also that removes the None-check for every call. When None is detected in a first call - RuntimeException is thrown and interpreter will exit - so None will be checked only at first use when cache is set.

Copy link
Contributor

@amoghrajesh amoghrajesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be OK with the current pattern we have for the sake of ease of understanding but if we have to change it, I do not think there is a better solution than this one. Would require some getting used to to use the new pattern

@jscheffl
Copy link
Contributor Author

jscheffl commented Dec 30, 2025

and every call site becomes verbose:

Before:

SUPERVISOR_COMMS.send(msg)

After (unnecessary function call)

supervisor_comms().send(msg)

Every access now requires a function call + None check instead of direct variable access.

@kaxil @amoghrajesh fair points and I am also not 100% convinced the solution is perfect. Te usage of the SUPERVISOR_COMMS seems still to be a case where we mess a lot with a global variable and encapsulating the state is still beneficial.
Had another sleep about this and understand that supervisor_comms().send() is also not cool, how about if we change it to:

supervisor_send(data)

...as a shortcut? Would it make it better to handle the sending in the function and have better readable code w/o global?

By the way with @cache I assume execution is slower as the has need to be built and then a lookup must be made which is more expensive than a function pointer lookup and jump. Therefore would not like a @cache solution.

@potiuk
Copy link
Member

potiuk commented Dec 30, 2025

By the way with @cache I assume execution is slower as the has need to be built and then a lookup must be made which is more expensive than a function pointer lookup and jump. Therefore would not like a @cache solution.

I don't think so. Calling a function in Python is generally very slow operation. It's not a classic pointer jump. Functions calls in Python are done by interpreter, they are not using jumps as C programs do. Interpreter has to look-up the method to call, create a new frame, push it on "Python stack" and clean the frame after it returns. This is all done "in the interpreter" - it's not even using the processor stack. "Python stack" for method frames is actually stored in heap memory, not in processor stack - so any stack manipulation (calling and returning from function) is kinda slow.

I did some basic micro-benchmarks:

import time
from functools import lru_cache

class MethodBenchmark:
    def __init__(self):
        self.call_count = 0

    def empty_method(self):
        """Empty method without caching"""
        self.call_count += 1
        return None

    @lru_cache(maxsize=128)
    def cached_empty_method(self):
        """Empty method with caching"""
        return None


def benchmark():
    obj = MethodBenchmark()
    iterations = 1_000_000

    # Benchmark non-cached method
    start = time.perf_counter()
    for _ in range(iterations):
        obj.empty_method()
    non_cached_time = time.perf_counter() - start

    # Benchmark cached method
    start = time.perf_counter()
    for _ in range(iterations):
        obj.cached_empty_method()
    cached_time = time.perf_counter() - start

    # Results
    print(f"Non-cached method: {non_cached_time:.6f} seconds")
    print(f"Cached method: {cached_time:.6f} seconds")
    print(f"Speedup: {non_cached_time / cached_time:.2f}x")
    print(f"Non-cached call count: {obj.call_count}")


if __name__ == "__main__":
    benchmark()

Result with Python 3.10

/Users/jarekpotiuk/code/airflow/.venv/bin/python /Users/jarekpotiuk/Library/Application Support/JetBrains/IntelliJIdea2025.3/scratches/scratch_5.py 
Non-cached method: 0.026673 seconds
Cached method: 0.026702 seconds
Speedup: 1.00x
Non-cached call count: 1000000

Process finished with exit code 0

This means that when you put @cache -> it never runs slower than method call, and additionaly you save on all the executed code inside.

I modified the code of both methods to do single if:

        if self.call_count == 0:
            self.call_count += 1
        else:
            self.call_count += 1

And there are the results:

Non-cached method: 0.032580 seconds
Cached method: 0.026062 seconds
Speedup: 1.25x
Non-cached call count: 1000001

Process finished with exit code 0

(100001 is because cached call increased it by 1)

There are plenty of optimisations in Python 3.11 - 3.14 that might skew this simple example (specializing adaptive interpreter changes and JIT) - so I run it with Python 3.10

Similar discussion: https://stackoverflow.com/questions/14648374/python-function-calls-are-really-slow

@potiuk
Copy link
Member

potiuk commented Dec 30, 2025

supervisor_send(data)

I think that's a good idea, it's likely better to expose "comms actions" than "comm" itself.

@potiuk
Copy link
Member

potiuk commented Dec 30, 2025

BTW. To be perfectly honest, In this case I think performance is not as important (at least until we will start doing the communication very frequently - for example following @dabla optimisation / async task reporting back status of individual async coroutines back to scheduler). The comms overhead for inter-process communication and serialization of data involved (every such call needs to serialize data sent across the wire - even if shared memory is used to communicate between processes) is already likely order of magnitude slower than single method call in Python, so this should not be too much of a concern.

In this case I think we should optimise for readability, I also do not like the extra () needed in the original proposal - that's why supervisor_send(data) is probably best approach.

@amoghrajesh
Copy link
Contributor

In this case I think we should optimise for readability, I also do not like the extra () needed in the original proposal - that's why supervisor_send(data) is probably best approach.

I agree with this. If we have to go down the way of removing globals here, I would prioritise using supervisor_send or supervisor_comms_send

@jscheffl jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from 780ce42 to df71bea Compare December 31, 2025 15:48
@jscheffl
Copy link
Contributor Author

jscheffl commented Dec 31, 2025

Thanks for the feedback, have it now adjusted to use supervisor_send() which is a bit shorter.

Re-review appreciated :-D

(Am surprised actually that I wanted to remove two global statements but actually am slimming down the codebase by 40 LoC now :-D)

@jscheffl jscheffl requested review from ashb and dabla January 11, 2026 19:38
@jscheffl jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from 2df8bfe to 18f93b3 Compare January 15, 2026 22:44
@jscheffl
Copy link
Contributor Author

Up by another 109 commits - can I have another round of feedback @ashb / @amoghrajesh / @kaxil ?

@jscheffl jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from 18f93b3 to c103ddf Compare January 18, 2026 10:41
@jscheffl
Copy link
Contributor Author

Up by another 59 commits

@amoghrajesh
Copy link
Contributor

I'd leave @ashb / @kaxil to be a judge of this one

@dabla
Copy link
Contributor

dabla commented Jan 19, 2026

Reworked PR with the review feedback, especially changed to a Singleton implementation to prevent a static class with static member. Better like this?

I like that refactor.

@jscheffl jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from c103ddf to a5aeb48 Compare January 27, 2026 22:52
@jscheffl
Copy link
Contributor Author

up by 178 commits

@jscheffl jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from a5aeb48 to 66b9a1a Compare January 31, 2026 19:34
@jscheffl jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from 66b9a1a to 9ec7b07 Compare February 7, 2026 18:10
@jscheffl
Copy link
Contributor Author

jscheffl commented Feb 7, 2026

Up by 200 commits

Copy link
Contributor

@dabla dabla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like this refactoring, DRY-principle was applied, even more typing, looks really clean to me.

@ashb
Copy link
Member

ashb commented Feb 13, 2026

Much as @kaxil said in his very first comment, I'm still -1 on this, You've changed one kind of global for another.

before the global was SUPERVISOR_COMMS on the task_runner module.

task_runner.SUPERVISOR_COMMS = comms_decoder

After the global is SupervisorComms._instance.

    SupervisorComms().set_comms(comms_decoder)

This is a change for no difference to my mind. Honestly I don't see the point of the change.

@dabla
Copy link
Contributor

dabla commented Feb 13, 2026

Much as @kaxil said in his very first comment, I'm still -1 on this, You've changed one kind of global for another.

before the global was SUPERVISOR_COMMS on the task_runner module.

task_runner.SUPERVISOR_COMMS = comms_decoder

After the global is SupervisorComms._instance.

    SupervisorComms().set_comms(comms_decoder)

This is a change for no difference to my mind. Honestly I don't see the point of the change.

I'm not that opinionated on that point, so I'm neutral here but I like the other refactorings that have been done:

  • lots of duplicated code lines has been replaced with a re-useable method named is_client_process_context (DRY)
  • even more typing was applied which is always good

So I think this PR has added value, but that's my personal opinion.

@jscheffl
Copy link
Contributor Author

This is a change for no difference to my mind. Honestly I don't see the point of the change.

I tried to follow your request to adjust. I had a static class instance before and went into a singleton pattern. global is evil and we should follow best practices and good standards. I would also follow any further constructive comment on it.

Also please see that compared to SUPERVISOR_COMMS.send() now we have supervisor_send() as call which rads much cleaner. The upper-casing directly jumps into my eyes as something "evil".

Yes, technically a singleton is also a global but with the construct it allows a controlled access and makes clear that access is validated. All calls can trust that the method will fail with a clear error. If the codebase grows we will lose control over time where the global is spread in the context. I saw multiple times when testing changes here that a pytest failed on the global set to None and needed to search where some other test messed-up the global.

I put effort in this because I'd prefer to have a codebase where we are clean and can be proud of, local optimizations with global... yes might be working but leave a smell of not good practice and even if the change is in your view technically neutral it is an improvement to remove two usages of global.

@dabla
Copy link
Contributor

dabla commented Feb 14, 2026

This is a change for no difference to my mind. Honestly I don't see the point of the change.

I tried to follow your request to adjust. I had a static class instance before and went into a singleton pattern. global is evil and we should follow best practices and good standards. I would also follow any further constructive comment on it.

Also please see that compared to SUPERVISOR_COMMS.send() now we have supervisor_send() as call which rads much cleaner. The upper-casing directly jumps into my eyes as something "evil".

Yes, technically a singleton is also a global but with the construct it allows a controlled access and makes clear that access is validated. All calls can trust that the method will fail with a clear error. If the codebase grows we will lose control over time where the global is spread in the context. I saw multiple times when testing changes here that a pytest failed on the global set to None and needed to search where some other test messed-up the global.

I put effort in this because I'd prefer to have a codebase where we are clean and can be proud of, local optimizations with global... yes might be working but leave a smell of not good practice and even if the change is in your view technically neutral it is an improvement to remove two usages of global.

It’s a bit unfortunate that Python doesn’t have a built-in Dependency Injection mechanism comparable to what frameworks like Spring provide in the Java ecosystem. Then components like the CommsSupervisor would be a singleton by default and lifecycle/visibility are explicitly controlled via configuration or annotations in Java which would translate to decorators in Python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

all versions If set, the CI build will be forced to use all versions of Python/K8S/DBs area:DAG-processing area:task-sdk area:Triggerer full tests needed We need to run full set of tests for this PR to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants