Product
Phase F — Production-grade platform (non-functional requirements)
Added 2026-06-13 at the owner's request. These are the cross-cutting qualities that turn the product from "works" into "premium, trustworthy, scalable platform". None are user-features; all are platform foundations.
What the platform is becoming
It started as a CRM. It is now a multi-tenant, AI-native, white-label agency platform — a HighLevel/GHL-class all-in-one operating system that an agency runs, white-labels and resells to its clients (sub-accounts), with an installable module marketplace, an AI onboarding concierge + AI inbox assistant, per-account custom roles, omnichannel inbox, native automation, and (this phase) production-grade observability, security and scalability. Differentiator vs GHL: AI-first, cleaner UX, self-hosted control of the data and identity graph.
F1 — Monitoring & observability (owner asked: "what to include?")
Recommendation, in priority order:
- In-app "System Health" view (master-only) — the highest-value, most product-aligned piece. A live page showing per-subsystem AND per-account status: Channels (connected/error + last sync), Email connector (last poll), AI (configured? last call ok? usage/cost), Automations (running? failures), Deploys (last status per app), DB (connections, slow queries), background jobs/cron. Green / yellow / red per function — exactly the "qué funciona, qué da problemas, qué está detenido" the owner described.
- Error tracking — self-hosted Sentry / GlitchTip. Frontend + backend exceptions tagged with account-id / user-id / route; alert on spikes.
- Audit log (security, non-negotiable) — immutable record of sensitive actions: impersonation, role changes, account suspend/activate, integration key changes, logins. Per account. Required for trust + compliance.
- Structured logging — every request logs request-id + tenant-id + user-id, so any issue traces to an account.
- Uptime monitoring — Uptime Kuma (self-hosted) pinging public endpoints.
- Metrics dashboards — request latency p50/p95/p99, error rate, throughput, DB query time, queue depth, AI tokens/cost, per-account usage (Grafana+Prometheus or Coolify metrics).
- Alerts — downtime, error-rate spike, failed deploy, cron failure, cert expiry, disk/memory thresholds (later: payment failures).
- Status page (optional, public) — for customers.
F2 — Database RLS (defense-in-depth) [HIGH PRIORITY]
Today isolation is application-level only (multi-tenant plugin + a where[tenant] filter). A dedicated non-superuser DB role + per-tenant Row-Level-Security policies + FORCE ROW LEVEL SECURITY make the database itself refuse cross-tenant rows even if an app query forgets to filter (the services leak proved why this matters). Plan: app role NOSUPERUSER, tenant_id policies on every scoped table, tenant set per connection/transaction.
F3 — API quality
Consistent REST shape (incl. errors), versioning (/api/v1), idempotency keys on mutating endpoints, a documented public API + signed webhooks for integrations (a platform needs these), pagination everywhere, OpenAPI docs.
F4 — Security (well-founded, not by obscurity)
Layered: the RLS above + server-side authz on every route + input validation at the boundary + secrets in a manager (not env where avoidable) + 2FA/MFA for admins + the fingerprint hardening already started (no X-Powered-By, neutral names, lock down the public Payload API) + dependency scanning + a periodic review. Obscurity is a thin layer on top, never the defense.
F5 — Rate limiting
Per-IP + per-account + per-endpoint limits, strict on auth (login/reset) and the public API, with sane 429s. Protects against brute-force, scraping and abuse. Edge (Traefik/middleware) + app-level (Redis token bucket).
F6 — Caching
Redis for sessions + hot data; HTTP/CDN caching for static + public pages; React cache() + Next data cache for per-request dedupe; query-result caching for expensive reads; ISR where applicable. Cuts latency + DB load.
F7 — Scalability (premium target)
Stateless app tier → horizontal scaling; DB connection pooling (PgBouncer); background queues for heavy/slow work (email, AI, exports, webhooks) instead of inline; read replicas for reporting; per-tenant resource fairness; designed for multi-region readiness. Goal: scale to many agencies × many sub-accounts without per-account degradation.
Suggested sequencing
RLS (F2) and the System-Health view + audit log (F1) first — they are the trust/security foundation. Then rate limiting (F5), caching/queues (F6/F7), error tracking, and API hardening (F3/F4) as the platform takes real load.