Saving stress testing

fix(P2): add OperationalError retry to sql_insert, sql_select, sql_insert_or_update
All three were missing the transient-connection retry that sql_update and run_sql_select already had. On OperationalError (stale/dropped connection), each now retries once with a fresh engine.connect() without disposing the pool. IntegrityError (duplicate key, FK violation, NOT NULL) continues to return None without retrying — the same data would fail again and None signals a data conflict to callers, distinct from False (error) or an int (success). sql_insert_or_update retry is safe because ON DUPLICATE KEY UPDATE is idempotent. sql_insert retry is safe because OperationalError means MariaDB rolled back. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-19 13:57:31 -04:00 · 2026-04-17 19:41:26 -04:00 · 2026-04-17 19:28:28 -04:00 · 2026-04-17 18:12:01 -04:00 · 2026-04-17 18:08:01 -04:00 · 2026-04-17 17:24:47 -04:00
7 changed files with 252 additions and 14 deletions
--- a/app/config.py
+++ b/app/config.py
@@ -25,8 +25,10 @@ class Settings(BaseSettings):
    DB_PASS:   str = Field('',           env='AE_DB_PASSWORD')

    # Connection tuning
-    DB_CONNECT_TIMEOUT: int = Field(20,   env='AE_DB_CONNECTION_TIMEOUT')
-    DB_POOL_RECYCLE:    int = Field(1800, env='AE_DB_POOL_RECYCLE')
+    DB_CONNECT_TIMEOUT:  int = Field(20,   env='AE_DB_CONNECTION_TIMEOUT')
+    DB_POOL_RECYCLE:     int = Field(1800, env='AE_DB_POOL_RECYCLE')
+    DB_POOL_SIZE:        int = Field(10,   env='AE_DB_POOL_SIZE')
+    DB_POOL_MAX_OVERFLOW: int = Field(20,  env='AE_DB_POOL_MAX_OVERFLOW')

    # --- Logging ---
    LOG_PATH_APP: str = Field('/logs/aether_api.log', env='AE_API_LOG_PATH')
@@ -73,8 +75,10 @@ class Settings(BaseSettings):
            'name':            self.DB_NAME,
            'username':        self.DB_USER,
            'password':        self.DB_PASS,
-            'connect_timeout': self.DB_CONNECT_TIMEOUT,
-            'pool_recycle':    self.DB_POOL_RECYCLE,
+            'connect_timeout':  self.DB_CONNECT_TIMEOUT,
+            'pool_recycle':     self.DB_POOL_RECYCLE,
+            'pool_size':        self.DB_POOL_SIZE,
+            'max_overflow':     self.DB_POOL_MAX_OVERFLOW,
        }

    @property
--- a/app/lib_sql_core.py
+++ b/app/lib_sql_core.py
@@ -43,9 +43,15 @@ def create_ae_engine(uri: str):

 engine = create_ae_engine(db_uri)

-# DEPRECATED: Global shared 'db' connection. Use engine.connect() in context managers instead.
-# Keeping for legacy compatibility but will phase out usage in crud lib.
-db = engine.connect()
+# DEPRECATED: Global shared 'db' connection. Still used by lib_schema_v3.py and lib_api_crud_v3.py.
+# TODO (P3 full fix): migrate those two call sites to engine.connect() context managers, then remove this.
+# Bare connect guarded so a Docker startup race (MariaDB not yet ready) doesn't crash the worker.
+# If this fails, db=None — callers that hit it before reconnect_db() runs will raise AttributeError.
+try:
+    db = engine.connect()
+except Exception:
+    log.warning("DB SQL Core: Initial db connection failed at startup (MariaDB not ready?). Will retry via reconnect_db().")
+    db = None

 log.info('DB SQL Core: Initializing engine...')

--- a/app/lib_sql_crud.py
+++ b/app/lib_sql_crud.py
@@ -11,7 +11,7 @@ from sqlalchemy.exc import IntegrityError, OperationalError, ProgrammingError
 from app.log import log, logger_reset
 # CRITICAL: Import the core module to access current global state
 from app import lib_sql_core
-from app.lib_sql_core import sql_connect, set_last_sql_error
+from app.lib_sql_core import set_last_sql_error

 # log.setLevel(logging.DEBUG) # DEBUG, INFO, WARNING, ERROR, EXCEPTION, CRITICAL

@@ -63,11 +63,29 @@ def sql_insert(
                return result_insert.lastrowid
            return False
    except IntegrityError as e:
+        # Data constraint violation (duplicate key, FK mismatch, NOT NULL) — do NOT retry;
+        # the same data would fail again. Return None so callers can distinguish from errors.
        if trans: trans.rollback()
        log.error('Integrity error (likely duplicate). Returning None')
        log.debug(e)
        set_last_sql_error(e)
        return None
+    except OperationalError:
+        # Transient connection failure. The broken connection rolls back on MariaDB's side,
+        # so retrying with a fresh connection is safe.
+        if trans: trans.rollback()
+        log.warning('Operational error in sql_insert. Retrying once with fresh connection...')
+        try:
+            with lib_sql_core.engine.connect() as conn:
+                trans = conn.begin()
+                result_insert = conn.execute(sql_insert_stmt, data)
+                trans.commit()
+                if result_insert.rowcount == 1 and result_insert.lastrowid > 0:
+                    return result_insert.lastrowid
+                return False
+        except Exception as e:
+            set_last_sql_error(e)
+            return False
    except Exception as e:
        if trans: trans.rollback()
        log.error('Unknown exception in sql_insert. Returning False')
@@ -138,7 +156,6 @@ def sql_update(
    except OperationalError:
        if trans: trans.rollback()
        log.error('Operational error (gone away?). Retrying once...')
-        sql_connect()
        try:
            with lib_sql_core.engine.connect() as conn:
                trans = conn.begin()
@@ -199,6 +216,19 @@ def sql_insert_or_update(
            res = conn.execute(stmt, data)
            trans.commit()
            return res.lastrowid if res.lastrowid > 0 else True
+    except OperationalError:
+        # ON DUPLICATE KEY UPDATE is idempotent — safe to retry.
+        if trans: trans.rollback()
+        log.warning('Operational error in sql_insert_or_update. Retrying once...')
+        try:
+            with lib_sql_core.engine.connect() as conn:
+                trans = conn.begin()
+                res = conn.execute(stmt, data)
+                trans.commit()
+                return res.lastrowid if res.lastrowid > 0 else True
+        except Exception as e:
+            set_last_sql_error(e)
+            return False
    except Exception as e:
        if trans: trans.rollback()
        log.exception(e)
@@ -309,6 +339,21 @@ def sql_select(
                return [] if as_list else None

            rows = result.all()
+    except OperationalError:
+        # Transient connection failure — reads are always safe to retry.
+        log.error('Operational error in sql_select. Retrying once with fresh connection...')
+        try:
+            with lib_sql_core.engine.connect() as conn:
+                result = conn.execute(stmt, data)
+                if not result:
+                    return [] if as_list else None
+                if hasattr(result, 'returns_rows') and not result.returns_rows:
+                    return [] if as_list else None
+                rows = result.all()
+        except Exception as e:
+            log.error(f"SQL Fetch Error on retry: {e}")
+            set_last_sql_error(e)
+            return False
    except Exception as e:
        log.error(f"SQL Fetch Error: {e}")
        set_last_sql_error(e)
@@ -343,7 +388,6 @@ def run_sql_select(
            return conn.execute(sql, data)
    except (OperationalError, ProgrammingError) as e:
        log.error(f'DB Error: {e}. Retrying once...')
-        sql_connect()
        try:
            with lib_sql_core.engine.connect() as conn:
                return conn.execute(sql, data)
--- a/app/routers/api.py
+++ b/app/routers/api.py
@@ -4,15 +4,15 @@ from typing import Dict, List, Optional, Set, Union
 from sqlalchemy import text
 import json
 import time
-import secrets
+# import secrets
 import jwt as pyjwt # Avoid conflict with app.lib_jwt

-from app.db_connection import db
+# from app.db_connection import db
 from app.lib_general import sign_jwt, decode_jwt, log, logging
 from app.config import settings
 from app.db_sql import sql_insert, sql_update, sql_select, redis_lookup_id_random, get_id_random

-from app.routers.api_crud import delete_obj_template, get_obj_template, get_obj_li_template, patch_obj_template, post_obj_template
+# from app.routers.api_crud import delete_obj_template, get_obj_template, get_obj_li_template, patch_obj_template, post_obj_template
 from app.routers.dependencies_v3 import DeprecationParams
 from app.models.api_models import Api_Base
 from app.models.response_models import Resp_Body_Base, mk_resp
--- a/documentation/TODO__Agents.md
+++ b/documentation/TODO__Agents.md
@@ -12,6 +12,16 @@
 - [x] **Config Refactor:** Switch `app/config.py` to `pydantic-settings` to use direct Env Vars (Stop mounting config files).
 - [x] **Locking:** Generate a `requirements.lock` for bit-identical builds.

+## 🔌 DB Connection Hardening (April 2026 Audit)
+> Identified during pre-show review. Issues 1 and 2 likely explain observed random connection lags.
+
+- [x] **[P1] Remove zombie `db_connection.py` import** — `app/routers/api.py` imports `db` from `app/db_connection.py`, creating a parasitic second SQLAlchemy engine at startup that is never updated by `reconnect_db()` after bootstrap. The imported `db` is only used in a commented-out line (`api.py:268`). Fix: remove the import; delete or archive `db_connection.py`.
+- [x] **[P1] Fix retry mechanism in `sql_update` / `run_sql_select`** — On `OperationalError`, both call `sql_connect()` → `reconnect_db()` which calls `engine.dispose()`, nuking the entire connection pool mid-flight. Under concurrent requests this kills other in-flight connections. Fix: remove the `sql_connect()` retry call; SQLAlchemy's `pool_pre_ping=True` already handles stale connections — just open a fresh `engine.connect()` for the retry without disposing the pool.
+- [x] **[P2] Add retry logic to `sql_insert` and `sql_select`** — Added `OperationalError` retry (single fresh connection attempt) to `sql_insert`, `sql_select`, and `sql_insert_or_update`. `IntegrityError` (duplicate key, FK violation) correctly bypasses retry and returns `None` — retrying the same data would fail again.
+- [x] **[P3] Guard `db = engine.connect()` in `lib_sql_core.py` with try/except** — Wrapped in try/except; sets `db = None` on failure so Docker startup race no longer crashes the worker.
+    - [ ] **[P3 full]** Migrate `lib_schema_v3.py:39` and `lib_api_crud_v3.py:166` off the global `db` to `engine.connect()` context managers, then remove the global `db` entirely.
+- [x] **[P4] Expose `pool_size` / `max_overflow` as env vars** — `create_ae_engine()` calls `settings.DB.get('pool_size', 10)` but `settings.DB` property doesn't include those keys, so they're always hardcoded 10/20. Add `AE_DB_POOL_SIZE` / `AE_DB_POOL_MAX_OVERFLOW` to `config.py`.
+
 ## 📋 Feature Tasks
 - [x] **Core Isolation:** Harden `apply_forced_account_filter` to Fail-Closed.
 - [x] **IDAA Baseline:** Remove `public_read` from Event, CMS, and Archive objects.
--- a/tests/README.md
+++ b/tests/README.md
@@ -7,7 +7,7 @@ This directory contains the automated and manual test scripts for the Aether Fas
 - **`unit/`**: Isolated logic tests. These use heavy mocking to bypass database and network requirements. Fast and safe to run in any environment.
 - **`integration/`**: Local environment tests. These verify component interactions, often requiring a connection to the local MariaDB/Redis instance.
 - **`e2e/` (End-to-End)**: Network-based API tests. these use the `requests` library to call the live API endpoints at `https://dev-api.oneskyit.com`.
- **`tools/`**: Utility scripts for administrative tasks like registry generation or Docker exploration.
+- **`tools/`**: Utility scripts for administrative tasks like registry generation, Docker exploration, and performance stress testing.
 - **`archive/`**: Legacy or deprecated scripts kept for historical reference.

 ## 📜 Standardized E2E Suite (`tests/e2e/`)
@@ -38,6 +38,28 @@ These consolidated scripts are the primary verification tool for the V3 API.

 ---

+## 🔧 Tools (`tests/tools/`)
+
+| Script | Description |
+| :--- | :--- |
+| `stress_list_queries.py` | **Read-only concurrency stress test.** Fires N worker threads making R sequential requests across all V3 list endpoints. Reports per-endpoint p50/p95/max latency and error counts. CLI: `--workers` (default 10), `--requests` (default 5), `--limit` (default 20), `--base-url` (default dev API). Exit code 1 on any error. |
+| `tool_generate_registry.py` | Generates the object type registry from source definitions. |
+| `tool_mcp_docker_explorer.py` | Explores running Docker containers via the MCP bridge. |
+
+**Stress test quick reference:**
+```bash
+# Baseline (10 workers, 5 rounds, 400 total requests)
+./environment/bin/python3 tests/tools/stress_list_queries.py
+
+# Heavy load (35 workers, 5 rounds, 1400 total requests)
+./environment/bin/python3 tests/tools/stress_list_queries.py --workers 35 --requests 5
+
+# Target a different environment
+./environment/bin/python3 tests/tools/stress_list_queries.py --base-url https://api.oneskyit.com --workers 5
+```
+
+---
+
 ## 🛠️ Shared Helpers

 - **`mock_config_helper.py`**: A critical utility that mocks `app.config.settings` before other modules are imported. Use this in unit tests.
--- a/tests/tools/stress_list_queries.py
+++ b/tests/tools/stress_list_queries.py
@@ -0,0 +1,152 @@
+"""
+Read-only concurrent stress test against V3 list endpoints.
+Fires N workers each making R sequential requests across a set of
+list endpoints, then prints per-endpoint latency stats and an
+overall error summary.
+
+Usage (from project root):
+    ./environment/bin/python3 tests/tools/stress_list_queries.py
+    ./environment/bin/python3 tests/tools/stress_list_queries.py --workers 20 --requests 10
+    ./environment/bin/python3 tests/tools/stress_list_queries.py --base-url https://api.oneskyit.com --workers 5
+"""
+import argparse
+import math
+import statistics
+import sys
+import time
+from concurrent.futures import ThreadPoolExecutor, as_completed
+
+import requests
+
+DEFAULT_BASE_URL = "https://test-api.oneskyit.com"
+API_KEY          = "nT0jPeiCfxSifkiDZur9jA"
+ACCOUNT_ID       = "_XY7DXtc9MY"  # One Sky IT Demo
+
+HEADERS = {
+    "x-aether-api-key": API_KEY,
+    "x-account-id":     ACCOUNT_ID,
+}
+
+# Read-only list endpoints to hammer. Each is a (label, path) tuple.
+ENDPOINTS = [
+    ("event list",              "/v3/crud/event/"),
+    ("event_session list",      "/v3/crud/event_session/"),
+    ("event_badge list",        "/v3/crud/event_badge/"),
+    ("event_file list",         "/v3/crud/event_file/"),
+    ("person list",             "/v3/crud/person/"),
+    ("journal list",            "/v3/crud/journal/"),
+    ("hosted_file list",        "/v3/crud/hosted_file/"),
+    ("data_store list",         "/v3/crud/data_store/"),
+]
+
+
+def percentile(sorted_times: list[float], pct: float) -> float:
+    """Return the pct-th percentile of a pre-sorted list (0–100)."""
+    if not sorted_times:
+        return 0.0
+    k = (len(sorted_times) - 1) * pct / 100
+    lo, hi = int(math.floor(k)), int(math.ceil(k))
+    return sorted_times[lo] + (sorted_times[hi] - sorted_times[lo]) * (k - lo)
+
+
+def do_request(label: str, url: str, session: requests.Session) -> dict:
+    t0 = time.perf_counter()
+    try:
+        r = session.get(url, headers=HEADERS, timeout=15)
+        elapsed = (time.perf_counter() - t0) * 1000
+        return {"label": label, "status": r.status_code, "ms": elapsed, "error": None}
+    except Exception as e:
+        elapsed = (time.perf_counter() - t0) * 1000
+        return {"label": label, "status": 0, "ms": elapsed, "error": str(e)}
+
+
+def worker(worker_id: int, requests_per_worker: int, base_url: str, limit: int) -> list[dict]:
+    results = []
+    with requests.Session() as session:
+        for _ in range(requests_per_worker):
+            for label, path in ENDPOINTS:
+                url = f"{base_url}{path}?limit={limit}"
+                results.append(do_request(label, url, session))
+    return results
+
+
+def print_result(label, success, message=""):
+    icon = "✅" if success else "❌"
+    suffix = f" — {message}" if message else ""
+    print(f"  [{icon}] {label}{suffix}")
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Concurrent read-only stress test")
+    parser.add_argument("--workers",  type=int, default=10,             help="Concurrent worker threads (default: 10)")
+    parser.add_argument("--requests", type=int, default=5,              help="Requests per worker per endpoint (default: 5)")
+    parser.add_argument("--limit",    type=int, default=20,             help="?limit= param on each list request (default: 20)")
+    parser.add_argument("--base-url", type=str, default=DEFAULT_BASE_URL, help=f"API base URL (default: {DEFAULT_BASE_URL})")
+    args = parser.parse_args()
+
+    total_requests = args.workers * args.requests * len(ENDPOINTS)
+    print(f"\n🔥 Stress Test: {args.workers} workers × {args.requests} rounds × {len(ENDPOINTS)} endpoints = {total_requests} total requests")
+    print(f"   Target: {args.base_url}  limit={args.limit}\n")
+
+    all_results: list[dict] = []
+    suite_start = time.perf_counter()
+
+    with ThreadPoolExecutor(max_workers=args.workers) as pool:
+        futures = [pool.submit(worker, wid, args.requests, args.base_url, args.limit) for wid in range(args.workers)]
+        for f in as_completed(futures):
+            all_results.extend(f.result())
+
+    suite_elapsed = time.perf_counter() - suite_start
+
+    # --- Per-endpoint stats ---
+    print("─" * 60)
+    print(f"{'Endpoint':<35} {'OK':>5} {'ERR':>5} {'p50ms':>7} {'p95ms':>7} {'maxms':>7}")
+    print("─" * 60)
+
+    by_label: dict[str, list[dict]] = {}
+    for r in all_results:
+        by_label.setdefault(r["label"], []).append(r)
+
+    any_fail = False
+    for label, _ in ENDPOINTS:
+        rows = by_label.get(label, [])
+        ok   = [r for r in rows if r["status"] in (200, 201, 404) and not r["error"]]
+        err  = [r for r in rows if r not in ok]
+        times = sorted(r["ms"] for r in ok)
+        p50 = statistics.median(times) if times else 0
+        p95 = percentile(times, 95)
+        mx  = max(times) if times else 0
+        flag = "" if not err else " ⚠"
+        if err:
+            any_fail = True
+        print(f"  {label:<33} {len(ok):>5} {len(err):>5} {p50:>7.0f} {p95:>7.0f} {mx:>7.0f}{flag}")
+
+    print("─" * 60)
+
+    # --- Error detail ---
+    errors = [r for r in all_results if r["error"] or r["status"] not in (200, 201, 404)]
+    if errors:
+        print(f"\n⚠  {len(errors)} errors encountered:")
+        seen = set()
+        for r in errors:
+            key = (r["label"], r["status"], r["error"])
+            if key not in seen:
+                seen.add(key)
+                print(f"   [{r['status']}] {r['label']}: {r['error'] or 'non-2xx/404'}")
+    else:
+        print("\n✅ Zero errors.")
+
+    # --- Overall summary ---
+    all_times = sorted(r["ms"] for r in all_results if not r["error"])
+    rps = total_requests / suite_elapsed
+    print(f"\n🏁 {total_requests} requests in {suite_elapsed:.2f}s  ({rps:.1f} req/s)")
+    if all_times:
+        print(f"   p50={statistics.median(all_times):.0f}ms  "
+              f"p95={percentile(all_times, 95):.0f}ms  "
+              f"max={max(all_times):.0f}ms\n")
+
+    sys.exit(1 if any_fail else 0)
+
+
+if __name__ == "__main__":
+    main()
Author	SHA1	Message	Date
Scott Idem	e71906b59a	Saving stress testing	2026-04-19 13:57:31 -04:00
Scott Idem	3d89e95c24	fix(P2): add OperationalError retry to sql_insert, sql_select, sql_insert_or_update All three were missing the transient-connection retry that sql_update and run_sql_select already had. On OperationalError (stale/dropped connection), each now retries once with a fresh engine.connect() without disposing the pool. IntegrityError (duplicate key, FK violation, NOT NULL) continues to return None without retrying — the same data would fail again and None signals a data conflict to callers, distinct from False (error) or an int (success). sql_insert_or_update retry is safe because ON DUPLICATE KEY UPDATE is idempotent. sql_insert retry is safe because OperationalError means MariaDB rolled back. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 19:41:26 -04:00
Scott Idem	3db5f7c749	fix(P3): guard startup db connection with try/except in lib_sql_core Wraps the deprecated global `db = engine.connect()` in a try/except so a Docker startup race (MariaDB not yet ready) no longer crashes the Gunicorn worker before it can serve any requests. Sets db=None on failure; reconnect_db() on the lifespan bootstrap path re-establishes it once credentials are confirmed. TODO (P3 full): migrate lib_schema_v3.py:39 and lib_api_crud_v3.py:166 off the global db to engine.connect() context managers, then remove it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 19:28:28 -04:00
Scott Idem	55debc8009	feat: add stress_list_queries tool and document in tests/README Concurrent read-only stress test against V3 list endpoints. Improvements over initial version: --base-url, --limit CLI flags, interpolated percentile calculation (accurate on small sample sizes), and pre-sorted times passed to overall summary. README: added tools table with quick-reference usage examples. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 18:12:01 -04:00
Scott Idem	ace00929f2	feat: expose DB pool_size and max_overflow as env vars (P4) Added AE_DB_POOL_SIZE and AE_DB_POOL_MAX_OVERFLOW to config.py with defaults matching prior hardcoded values (10/20). Wired into settings.DB property so create_ae_engine() reads them without fallback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 18:08:01 -04:00
Scott Idem	c7444a8a89	fix: remove pool-nuking reconnect_db() from OperationalError retry paths On OperationalError, sql_update and run_sql_select were calling sql_connect() → reconnect_db() which disposes the entire connection pool mid-flight, killing other in-flight connections under concurrency. Removed the sql_connect() calls; the existing retry blocks already open a fresh engine.connect() context manager, and pool_pre_ping=True handles stale connection detection. Also drops the now-unused sql_connect import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 17:24:47 -04:00
Scott Idem	8f1fe5d4df	Fixing this: #3 (zombie import) is genuinely a 2-line fix — remove the import from api.py:10 and move db_connection.py to trash. Zero functional change since db is only in a commented-out line.	2026-04-16 20:09:12 -04:00