fix: remove pool-nuking reconnect_db() from OperationalError retry paths
On OperationalError, sql_update and run_sql_select were calling sql_connect() → reconnect_db() which disposes the entire connection pool mid-flight, killing other in-flight connections under concurrency. Removed the sql_connect() calls; the existing retry blocks already open a fresh engine.connect() context manager, and pool_pre_ping=True handles stale connection detection. Also drops the now-unused sql_connect import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -11,7 +11,7 @@ from sqlalchemy.exc import IntegrityError, OperationalError, ProgrammingError
|
|||||||
from app.log import log, logger_reset
|
from app.log import log, logger_reset
|
||||||
# CRITICAL: Import the core module to access current global state
|
# CRITICAL: Import the core module to access current global state
|
||||||
from app import lib_sql_core
|
from app import lib_sql_core
|
||||||
from app.lib_sql_core import sql_connect, set_last_sql_error
|
from app.lib_sql_core import set_last_sql_error
|
||||||
|
|
||||||
# log.setLevel(logging.DEBUG) # DEBUG, INFO, WARNING, ERROR, EXCEPTION, CRITICAL
|
# log.setLevel(logging.DEBUG) # DEBUG, INFO, WARNING, ERROR, EXCEPTION, CRITICAL
|
||||||
|
|
||||||
@@ -138,7 +138,6 @@ def sql_update(
|
|||||||
except OperationalError:
|
except OperationalError:
|
||||||
if trans: trans.rollback()
|
if trans: trans.rollback()
|
||||||
log.error('Operational error (gone away?). Retrying once...')
|
log.error('Operational error (gone away?). Retrying once...')
|
||||||
sql_connect()
|
|
||||||
try:
|
try:
|
||||||
with lib_sql_core.engine.connect() as conn:
|
with lib_sql_core.engine.connect() as conn:
|
||||||
trans = conn.begin()
|
trans = conn.begin()
|
||||||
@@ -343,7 +342,6 @@ def run_sql_select(
|
|||||||
return conn.execute(sql, data)
|
return conn.execute(sql, data)
|
||||||
except (OperationalError, ProgrammingError) as e:
|
except (OperationalError, ProgrammingError) as e:
|
||||||
log.error(f'DB Error: {e}. Retrying once...')
|
log.error(f'DB Error: {e}. Retrying once...')
|
||||||
sql_connect()
|
|
||||||
try:
|
try:
|
||||||
with lib_sql_core.engine.connect() as conn:
|
with lib_sql_core.engine.connect() as conn:
|
||||||
return conn.execute(sql, data)
|
return conn.execute(sql, data)
|
||||||
|
|||||||
@@ -15,8 +15,8 @@
|
|||||||
## 🔌 DB Connection Hardening (April 2026 Audit)
|
## 🔌 DB Connection Hardening (April 2026 Audit)
|
||||||
> Identified during pre-show review. Issues 1 and 2 likely explain observed random connection lags.
|
> Identified during pre-show review. Issues 1 and 2 likely explain observed random connection lags.
|
||||||
|
|
||||||
- [ ] **[P1] Remove zombie `db_connection.py` import** — `app/routers/api.py` imports `db` from `app/db_connection.py`, creating a parasitic second SQLAlchemy engine at startup that is never updated by `reconnect_db()` after bootstrap. The imported `db` is only used in a commented-out line (`api.py:268`). Fix: remove the import; delete or archive `db_connection.py`.
|
- [x] **[P1] Remove zombie `db_connection.py` import** — `app/routers/api.py` imports `db` from `app/db_connection.py`, creating a parasitic second SQLAlchemy engine at startup that is never updated by `reconnect_db()` after bootstrap. The imported `db` is only used in a commented-out line (`api.py:268`). Fix: remove the import; delete or archive `db_connection.py`.
|
||||||
- [ ] **[P1] Fix retry mechanism in `sql_update` / `run_sql_select`** — On `OperationalError`, both call `sql_connect()` → `reconnect_db()` which calls `engine.dispose()`, nuking the entire connection pool mid-flight. Under concurrent requests this kills other in-flight connections. Fix: remove the `sql_connect()` retry call; SQLAlchemy's `pool_pre_ping=True` already handles stale connections — just open a fresh `engine.connect()` for the retry without disposing the pool.
|
- [x] **[P1] Fix retry mechanism in `sql_update` / `run_sql_select`** — On `OperationalError`, both call `sql_connect()` → `reconnect_db()` which calls `engine.dispose()`, nuking the entire connection pool mid-flight. Under concurrent requests this kills other in-flight connections. Fix: remove the `sql_connect()` retry call; SQLAlchemy's `pool_pre_ping=True` already handles stale connections — just open a fresh `engine.connect()` for the retry without disposing the pool.
|
||||||
- [ ] **[P2] Add retry logic to `sql_insert` and `sql_select`** — Both are missing the `OperationalError` retry that `sql_update` and `run_sql_select` have. A DB blip during an INSERT (scan record, badge log, etc.) fails silently and returns `False` with no recovery attempt.
|
- [ ] **[P2] Add retry logic to `sql_insert` and `sql_select`** — Both are missing the `OperationalError` retry that `sql_update` and `run_sql_select` have. A DB blip during an INSERT (scan record, badge log, etc.) fails silently and returns `False` with no recovery attempt.
|
||||||
- [ ] **[P3] Guard `db = engine.connect()` in `lib_sql_core.py` with try/except** — Line 48 is a bare connect at module load time with no error handling. If MariaDB isn't ready (Docker race), this throws unhandled and crashes the worker. Wrap in try/except like `db_connection.py` already does.
|
- [ ] **[P3] Guard `db = engine.connect()` in `lib_sql_core.py` with try/except** — Line 48 is a bare connect at module load time with no error handling. If MariaDB isn't ready (Docker race), this throws unhandled and crashes the worker. Wrap in try/except like `db_connection.py` already does.
|
||||||
- [ ] **[P4] Expose `pool_size` / `max_overflow` as env vars** — `create_ae_engine()` calls `settings.DB.get('pool_size', 10)` but `settings.DB` property doesn't include those keys, so they're always hardcoded 10/20. Add `AE_DB_POOL_SIZE` / `AE_DB_POOL_MAX_OVERFLOW` to `config.py`.
|
- [ ] **[P4] Expose `pool_size` / `max_overflow` as env vars** — `create_ae_engine()` calls `settings.DB.get('pool_size', 10)` but `settings.DB` property doesn't include those keys, so they're always hardcoded 10/20. Add `AE_DB_POOL_SIZE` / `AE_DB_POOL_MAX_OVERFLOW` to `config.py`.
|
||||||
|
|||||||
Reference in New Issue
Block a user