Fix DAG processor crash on MySQL connection failure during import error recording#59167
Fix DAG processor crash on MySQL connection failure during import error recording#59167AmosG wants to merge 1 commit intoapache:mainfrom
Conversation
…or recording The DAG processor was crashing when MySQL connection failures occurred while recording DAG import errors to the database. The root cause was missing session.rollback() calls after caught exceptions, leaving the SQLAlchemy session in an invalid state. When session.flush() was subsequently called, it would raise a new exception that wasn't caught, causing the DAG processor to crash and enter restart loops. This issue was observed in production environments where the DAG processor would restart 1,259 times in 4 days (~13 restarts/hour), leading to: - Connection pool exhaustion - Cascading failures across Airflow components - Import errors not being recorded in the UI - System instability Changes: - Add session.rollback() after caught exceptions in _update_import_errors() - Add session.rollback() after caught exceptions in _update_dag_warnings() - Wrap session.flush() in try-except with session.rollback() on failure - Add comprehensive unit tests for all failure scenarios - Update comments to clarify error handling behavior The fix ensures the DAG processor gracefully handles database connection failures and continues processing other DAGs instead of crashing.
|
Thanks. Nice one. |
|
From https://docs.sqlalchemy.org/en/20/orm/session_basics.html#flushing
|
|
Maybe, the root cause of MYSQL connection failure is #56879 |
totaly agree @wjddn279 |
Yeah. Worth fixing it with gc freezing I think. |
|
Do we know if the |
|
Fix from #60505 merged - @john-rodriguez-mgni - you seem to be very eager to find out when things are released - I hope you will equally eagerly apply the test to your case and test the rcs when they are out (subscribe to devlist to find out). |
|
so we are expecting this to be solved in 3.1.7 ? as i think i see this issue also on the api-server - causing endless loop unable to login |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
Fix DAG processor crash on MySQL connection failure during import error recording
Fixes #59166
The DAG processor was crashing when MySQL connection failures occurred while
recording DAG import errors to the database. The root cause was missing
session.rollback() calls after caught exceptions, leaving the SQLAlchemy
session in an invalid state. When session.flush() was subsequently called,
it would raise a new exception that wasn't caught, causing the DAG processor
to crash and enter restart loops.
This issue was observed in production environments where the DAG processor
would restart 1,259 times in 4 days (~13 restarts/hour), leading to:
Changes
session.rollback()after caught exceptions in_update_import_errors()session.rollback()after caught exceptions in_update_dag_warnings()session.flush()in try-except withsession.rollback()on failureTesting
Added 5 new unit tests in
TestDagProcessorCrashFixclass:test_update_dag_parsing_results_handles_db_failure_gracefullytest_update_dag_parsing_results_handles_dag_warnings_db_failure_gracefullytest_update_dag_parsing_results_handles_session_flush_failure_gracefullytest_session_rollback_called_on_import_errors_failuretest_session_rollback_called_on_dag_warnings_failureAll tests pass and verify that:
session.rollback()is called correctly on failuresImpact
The fix ensures the DAG processor gracefully handles database connection
failures and continues processing other DAGs instead of crashing, preventing
production outages from restart loops.