Architecture
ADR 0002 — Feedback OS: Omnisystem architecture (multi-tenant, scalable to 100+ clients)
- Date: 2026-06-09
- Status: Proposal (master design) — supersedes and extends ADR 0001
- Authors: Claude Code (development team) · Pamela (product)
- Audience: Feedback Studios team (2 people) + Claude Code as the engineering team.
Master architecture document. It does not describe "what exists today," but rather the target system and the path to reach it without rebuilding anything when there are 100+ clients. It is intentionally ambitious: the stated goal is to be the best marketing agency platform on the market, self-host-first, with no recurring fees except where honestly justifiable at scale.
Table of Contents
- Vision and architecture principles
- Product scope and capabilities
- Data architecture and multi-tenancy
- Service and application architecture
- API and integration contracts
- Infrastructure and scaling
- Security, identity, and compliance
- Observability and reliability
- AI in the platform
- Phased roadmap
- Honest risks and trade-offs
- Executive summary of decisions
1. Vision and architecture principles
1.1 What "the best agency system" means
It is not "many features." It is a single business operating system where:
- A lead arriving through the website, a message in Chatwoot, a Meta Ads campaign, an invoice, and a project task all live in the same data graph, with a single client identity and a single history.
- The agency proves value with data (ROAS, attribution, per-client reporting) instead of attaching screenshots.
- The end client logs in to a branded portal and sees their campaigns, approves creatives, and pays, without loose emails.
- Everything is automatable (n8n + events) and buildable by Claude Code (a single language, a single repo, typed contracts).
The competitive advantage is not the code: it is total integration + unified data + marginal cost near zero per new client.
1.2 Principles (non-negotiable)
| # | Principle | Concrete implication |
|---|---|---|
| P1 | Multi-tenant from day 1 | tenant_id on every piece of client data; isolation enforced in the DB, not in application code. |
| P2 | Single source of truth | A single relational database (PostgreSQL) as the source; Notion/Chatwoot/n8n are mirrors/actuators, never the truth. |
| P3 | API-first | Every capability exists first as a typed API before becoming a screen. The UI consumes its own public API. |
| P4 | Event-driven | State changes emit domain events (outbox). Integrations react to events; they are not coupled to the monolith. |
| P5 | Self-host-first, pragmatic | Open-source on the VPS by default. We pay only where self-hosting puts the business at risk (deliverable email, off-site backups, perhaps errors). Documented and reversible. |
| P6 | Scalable to 100+ (and 1000+) without rebuilding | Decisions that do NOT change shape as we grow: the multi-tenant model, the API contract, and the event bus are chosen so that scaling is "more machine/more nodes," not "rewrite." |
| P7 | Secure by default | Deny-by-default, RLS in Postgres, secrets out of git, immutable audit, GDPR built in. |
| P8 | Observable | Centralized logs/metrics/traces and SLOs from before the first paying client. |
| P9 | Automatable with Claude Code | TypeScript end-to-end, declarative schemas, autogenerated OpenAPI, versioned migrations, tests. The system documents itself so the AI can operate it. |
| P10 | Low marginal cost per tenant | A new client = new rows + config, not new infrastructure. |
1.3 The central tension (and how it is resolved)
"Self-hosted without subscriptions" vs "the best system on the market at scale."
It is resolved with one rule: self-host by default; pay only when a self-host failure puts the business or the client's data at risk. Three honest exceptions (detailed in §11): deliverable transactional email, off-site backups, and (optional) managed error tracking. Everything else runs on the VPS. The realistic recurring total at medium scale: tens of €/month, not thousands.
2. Product scope and capabilities
The platform is organized into domain modules (not microservices — see §4). Each module is a bounded context with its own entities, its own API, and its own events.
2.1 Module map and priority
| Module | Capability | Priority | Notes |
|---|---|---|---|
| Identity and tenancy | Users, organizations (agency/client), memberships, roles, SSO | MVP | Foundation of everything. A single identity. |
| CRM | Contacts, companies, leads, pipeline, scoring, activities | MVP | The heart of "sell more." |
| Projects and tasks | Per-client projects, tasks, milestones, time-tracking | MVP | Replaces/elevates the use of Notion. |
| Communication | Embedded Chatwoot, inbound/outbound email, unified timeline | MVP | One inbox per client. |
| Reporting and analytics | Per-client dashboards, KPIs, ROAS, basic attribution | MVP | Ingestion from Meta/Google Ads. |
| Client portal | Client login, view reports, approve creatives, view invoices | MVP+ | Same backend, separate app. |
| Assets and approvals | Lightweight DAM, versions, approval flow, comments | MVP+ | Storage in MinIO/S3. |
| Campaign management | Connect ad accounts, sync multichannel campaigns/insights | Phase 2 | Meta/Google first; TikTok/LinkedIn later. |
| Automations/workflows | Triggers → actions (via events + n8n) | Phase 2 | n8n as the visual engine; events from the bus. |
| Billing and contracts | Quotes, invoices, contracts, e-signature, payments | Phase 2 | Stripe or invoice+link; self-host e-sign (Documenso). |
| AI / Insights | Summaries, lead scoring, content generation, assistants | Phase 2-3 | Cross-cutting; see §9. |
| Advanced attribution | Multi-touch, server-side tracking, data warehouse | Future | When volume justifies it. |
| Marketplace/templates | Reusable report, project, and campaign templates | Future | Lever for "zero marginal cost." |
2.2 What makes the platform "leading" (differentiators)
- Unified per-client timeline: chat + email + leads + campaigns + invoices in a single timeline. Almost no one has this well integrated.
- Automatic white-label reporting: the client sees their dashboard with their brand; the data updates itself from the ad APIs. Zero manual reporting work.
- Creative approvals inside the portal: the client comments and approves; that triggers events (publish, notify, bill).
- AI focused on real value (not decorative chat): summarize conversations, prioritize leads, draft first versions of copy/creatives, explain the "why" behind a ROAS change.
- Everything automatable: every domain event can trigger a workflow in n8n.
3. Data architecture and multi-tenancy
This is the key piece of the system. Here we decide whether scaling to 100+ clients is trivial or a rewrite. The decision is deliberately conservative and proven.
3.1 Conceptual model: shared core + per-project data with the same structure
The owner's requirement —"each project/client has isolated internal data but they ALL share the same structure"— materializes in two planes:
┌──────────────────────────── CORE (shared, cross-tenant) ───────────────────────────────┐
│ organizations · users · memberships · roles · audit_log · api_keys │
│ (the "agency" is one organization; each "client" is another organization) │
└─────────────────────────────────────────────────────────────────────────────────────────┘
│ tenant_id (= client's organization)
▼
┌──────────────────────── PER-TENANT DATA (same structure for all) ───────────────────────┐
│ clients_profile · projects · contacts · companies · leads · deals · tasks · assets │
│ approvals · ad_accounts · campaigns · ad_insights · invoices · contracts · messages │
│ report_dashboards · automations · ai_jobs · events_outbox │
│ │
│ EVERY table in this plane carries: tenant_id (NOT NULL) + enforced RLS │
└─────────────────────────────────────────────────────────────────────────────────────────┘
Key to the model: "same schema for all" = a single table definition. Each client does not have different tables; they have the same tables, filtered by tenant_id. This is what makes the marginal cost of a new client ~0 and what lets a schema improvement benefit everyone at once.
3.2 Comparison of isolation strategies
| Criterion | Row-per-tenant + RLS (recommended) | Schema-per-tenant | Database-per-tenant |
|---|---|---|---|
| Isolation | Strong (enforced in the DB by RLS policy) | Very strong | Maximum |
| Cost per new tenant | ~0 (one more row) | Medium (create schema + migrate) | High (create DB + provision) |
| Migrations | Once, applies to all | N times (one per schema) | N times (one per DB) |
| Scaling to 100 tenants | Trivial | Manageable | Operationally heavy |
| Scaling to 1000+ tenants | Good (with indexes + partitioning) | Hundreds of schemas = catalog pain | Unfeasible without Citus-style orchestration |
| Data leak risk | Low if RLS is done right (deny-by-default + tests) | Low | Almost none |
| Cross-tenant reporting (the agency sees everything) | Trivial (1 query) | Hard (union of N schemas) | Very hard |
| Operation with a team of 2 + Claude Code | Optimal | Overhead | Untenable |
3.3 Recommendation: row-per-tenant with PostgreSQL Row-Level Security (RLS)
Decision: a model of shared rows + tenant_id + enforced RLS in Postgres, with a documented escape path to schema/DB-per-tenant only for an "enterprise" client who may someday require contractual physical isolation.
Reasons:
- It is the 2026 consensus for multi-tenant SaaS: start with shared schema + RLS for cost efficiency and operational simplicity, with a migration path to schema/DB isolation when an enterprise client demands it.
- RLS moves isolation from the app to the database: even if an application query forgets the
WHERE tenant_id = ...(a Claude Code bug or ours), Postgres does not return rows from another tenant. Isolation does not depend on never making mistakes. - Trivial agency reporting: the agency needs to see across tenants (global portfolio, aggregated KPIs). With rows+RLS it is a role that bypasses the policy; with schema/DB-per- tenant it would be a nightmare of joins.
- Migrations only once: a schema change is applied once and benefits all 100 clients. This is decisive for a team of 2.
How isolation is enforced (concrete pattern)
-- 1) Every tenant table carries the column and the policy
ALTER TABLE leads ENABLE ROW LEVEL SECURITY;
ALTER TABLE leads FORCE ROW LEVEL SECURITY; -- applies even to the table owner
CREATE POLICY tenant_isolation ON leads
USING (tenant_id = current_setting('app.tenant_id')::uuid)
WITH CHECK (tenant_id = current_setting('app.tenant_id')::uuid);
-- 2) The app connects with a role WITHOUT RLS bypass and, per request, sets the tenant:
-- SET LOCAL app.tenant_id = '<uuid-of-the-request-tenant>';
-- (LOCAL = lives only within the transaction; impossible to leak between requests)
-- 3) A separate "agency_admin" role may have BYPASSRLS for global reporting,
-- used ONLY by audited internal endpoints.
CI guarantee (non-negotiable): a CI check that fails if any table with a tenant_id column does not have RLS enabled and forced. One forgotten table = a data breach. Also, integration tests that insert as Tenant A and verify that Tenant B gets zero rows.
3.4 Base model conventions (every tenant table)
id uuid PRIMARY KEY DEFAULT gen_random_uuid() -- (or UUIDv7 for temporal ordering)
tenant_id uuid NOT NULL REFERENCES organizations(id) -- logical partition
created_at timestamptz NOT NULL DEFAULT now()
updated_at timestamptz NOT NULL DEFAULT now() -- update trigger
deleted_at timestamptz NULL -- soft-delete (NULL = alive)
created_by uuid NULL REFERENCES users(id)
version integer NOT NULL DEFAULT 1 -- optimistic locking
- UUIDv7 as id (sortable by time → good indexes, without exposing counts).
- Universal soft-delete: never a physical
DELETEon client data without a retention policy; the app's views filterdeleted_at IS NULL. - Audit: an append-only (immutable)
audit_logtable with who/what/when/before/after, fed by triggers + the application layer. - Indexes: every tenant-table index starts with
(tenant_id, ...)so queries always scope by tenant first.
3.5 Multi-tenant migration strategy
- Declarative, versioned migrations (Drizzle Kit or Payload's migrations, depending on §4) in
packages/db. One migration = once for all tenants. - Forward-only in production; each migration with its documented rollback.
- Expand → migrate → contract for non-trivial changes (add nullable column → batch backfill in a background job → make it NOT NULL/drop the old one), so deployment never breaks the app live.
- Migrations run in CI/CD (Forgejo Actions) before booting the new version.
3.6 Partitioning and data scale (when)
- Up to ~100 tenants and tables of a few million rows: one Postgres,
(tenant_id, ...)indexes, no partitioning. More than enough. - High-volume tables (
ad_insights,messages,events_outbox,audit_log): partition by time range (monthly) once they exceed ~50–100M rows; this enables a cheap drop of old partitions per retention policy. - At 1000+ tenants or huge tables: option to partition by
tenant_id(hash) or introduce Citus (transparent Postgres sharding) — but this is "more Postgres," not a model change. The data contract (rows +tenant_id) does not change.
4. Service and application architecture
4.1 Decision: modular monolith + workers, not microservices
For a team of 2 people + Claude Code, microservices would be operational suicide (N deployments, N databases, network latency, distributed sagas, multiplied observability). The right choice is a well-structured modular monolith:
- A single primary deployable (the Next.js app + the API/domain layer).
- Domain modules with clear boundaries (folders + import rules + interfaces) that communicate via internal events and through their own APIs, not by reaching into the neighbor's tables.
- Separate workers (same code, different process) for background jobs.
- If one day a module needs to scale independently, it already has its event boundary: extracting it to a service is a bounded refactor, not a rewrite. ("Monolith ready to decompose" pattern.)
┌───────────────────────────────────────────────┐
browser ──HTTPS──► │ Traefik (Coolify proxy, TLS Let's Encrypt) │
└───────────────┬───────────────────────────────┘
┌──────────────────┬─────────┴──────────┬──────────────────────┐
▼ ▼ ▼ ▼
apps/web apps/dashboard apps/portal apps/docs (OpenAPI)
(public web) (internal agency) (end client) (docs + API ref)
└───────┬──────────┴──────────┬──────────┘
▼ ▼
┌───────────────────────────────────────────────────────┐
│ CORE (packages/core + Payload/domain layer) │
│ modules: identity · crm · projects · comms · │
│ reporting · campaigns · billing · ai │
│ ── emits domain events → events_outbox ── │
└───────┬───────────────────────┬───────────────┬────────┘
▼ ▼ ▼
PostgreSQL (RLS) Redis (cache+queue) MinIO/S3 (assets)
│ │
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ Worker(s) │◄────────│ BullMQ (queues) │
│ jobs/sync/ai │ └──────────────────┘
└──────┬───────┘
│ outbox relay → webhooks/events
┌──────┴────────┬───────────────┬─────────────┐
▼ ▼ ▼ ▼
Chatwoot n8n Meta/Google Notion
(chat) (automation) Ads APIs (mirror)
4.2 Monorepo structure (target)
feedback-os/
├── apps/
│ ├── web/ # Public web (Azurio/Next.js). Already exists.
│ ├── dashboard/ # Internal agency app (Next.js).
│ ├── portal/ # White-label client portal (Next.js).
│ ├── api/ # (opt.) dedicated API/Payload server, or colocated in dashboard.
│ ├── worker/ # Background job process(es) (BullMQ workers).
│ └── docs/ # Documentation site + OpenAPI reference (Scalar).
├── packages/
│ ├── db/ # Schema, migrations, Postgres client, RLS helpers.
│ ├── core/ # Domain modules (identity, crm, projects, comms, ...).
│ ├── events/ # Domain event definitions + outbox/relay.
│ ├── api-contracts/ # Shared Zod/OpenAPI schemas (request/response types).
│ ├── integrations/ # Meta/Google/TikTok/Chatwoot/n8n/Notion clients.
│ ├── ui/ # Shared design system (see CLAUDE.md: builder rule does NOT apply here).
│ ├── auth/ # Sessions, RBAC/ABAC, SSO.
│ └── config/ # Shared ESLint/TS/Prettier/env.
└── infra/ # Runbooks, IaC, provisioning scripts.
4.3 The role of Payload CMS: data/admin backbone, not a prison
Decision: use Payload CMS (MIT, full self-host) as the content + admin + auth + ORM/collections + autogenerated API backbone, with its official multi-tenant plugin, but with three safeguards to avoid getting trapped:
- Payload on PostgreSQL (Postgres adapter, not Mongo) → so RLS and SQL remain available for the critical parts. Payload's multi-tenant plugin supports thousands of tenants with the right infrastructure.
- "Hard" business data (campaigns, insights, billing ledger, events) lives in tables managed by
packages/db(Drizzle) with RLS, not necessarily as Payload collections. Payload shines for editorial content, assets, admin UI, and access-controlled CRUD; for heavy analytical/financial work we run the SQL ourselves. - The public API (the one consumed by the portal and third parties) is defined in
packages/api-contractswith OpenAPI/Zod, so it does not depend on Payload's internal shape. If Payload is ever replaced, the contractual API stays.
In one sentence: Payload is the system's admin panel and CMS/auth; the financial and analytical truth lives in Postgres with RLS; the public API is our own contract. This gives the best of both: Payload's speed + control and independence.
Alternative considered and rejected as the single backbone: a 100% custom backend (Hono/Nest + Drizzle, without Payload). It gives total control but forces us to build admin, auth, RBAC, media, i18n, and CRUD by hand → weeks of work that Payload gives for free. We use it only alongside Payload for the parts where Payload does not fit, not in its place.
4.4 Domain layer and events
- Each module exposes use cases (application functions) that: validate (Zod), apply RBAC/ABAC, execute in a transaction that sets
app.tenant_id, persist, and write an event toevents_outboxwithin the same transaction (outbox pattern → zero lost events). - A relay (in
apps/worker) reads the outbox and publishes events to: BullMQ (jobs), n8n webhooks, and internal subscribers. Idempotent.
5. API and integration contracts
5.1 API style
| Aspect | Decision |
|---|---|
| Primary style | REST + JSON, per-tenant resources, standard HTTP verbs. |
| GraphQL | Yes, but scoped: Payload exposes GraphQL for flexible reads of the internal dashboard. The public/integration API is REST (simpler to version, cache, and hand to third parties). |
| Definition | OpenAPI 3.1 autogenerated from Zod schemas (packages/api-contracts). A single source → TS types + runtime validation + spec. |
| Documentation | Scalar (interactive reference, MIT) served at apps/docs → docs.feedback-studios.com. Always up to date because it is generated on each build from the contract. |
| Versioning | /v1/ prefix. Breaking changes → /v2/. Additive changes do not break. Deprecation policy with a Sunset header. |
| Authentication | Sessions (httpOnly cookie) for our own apps; API keys/OAuth2 per tenant for integrations and third parties. |
| Idempotency | Idempotency-Key header on POSTs that create resources or trigger external actions (payments, publish ads). |
| Pagination | Cursor-based (?cursor=&limit=) by default; stable and efficient at scale. |
| Rate-limiting | Per tenant and per API key (token bucket in Redis). Generous internal limits, strict external ones. |
| Errors | Uniform RFC 9457 (Problem Details) format. |
| Webhooks | Outbound, signed (HMAC) per tenant; retries with backoff; deliverable to n8n and to the client's systems. |
5.2 Integration layer (anti-corruption)
All integrations live in packages/integrations behind our own interfaces (anti- corruption layer pattern): the domain talks about "Campaign" and "Insight" in our terms; the adapter translates to/from the external API. Thus, switching providers or Meta changing its API does not contaminate the core.
| Integration | Direction | Mechanism | Notes |
|---|---|---|---|
| Meta Ads | pull insights, push campaigns | Graph API + sync jobs | Per-tenant tokens encrypted; scheduled sync. |
| Google Ads | pull insights | API + jobs | Same; normalize to ad_insights. |
| TikTok / LinkedIn Ads | pull insights | API + jobs | Phase 2-3; same normalized shape. |
| Chatwoot | bidirectional | Webhooks + API | Messages → client timeline; create/update contact. |
| n8n | output (trigger) and input (webhook) | Signed webhooks + API | n8n is the visual automation engine; it reacts to domain events. |
| Notion | outbound mirror | API | Notion stops being the truth; it syncs from Postgres for whoever still uses it. |
| inbound and outbound | SMTP/IMAP + transactional provider | See §6.5 and §11. |
Golden rule: integrations react to events and write via domain use cases (which apply RLS and audit). They never touch tables directly.
6. Infrastructure and scaling
6.1 Starting point and philosophy
Today: VPS Vidot (IONOS, Ubuntu 24.04, 4 vCPU / 8 GB / 232 GB), Coolify orchestrating containers, Forgejo (git), Next.js web, docs, + n8n and Chatwoot on separate hosts. The goal is to scale vertically and then horizontally without rebuilding, measuring before spending.
6.2 Target components
┌──────────────────────────── VPS / NODE(S) (Coolify) ─────────────────────────────┐
│ Traefik (TLS) │
│ apps: web · dashboard · portal · docs · api · worker(s) │
│ PostgreSQL 17 ── PgBouncer (pooling) ── read replica(s) (when needed) │
│ Redis 7 (cache + BullMQ queues + rate-limit + sessions) │
│ MinIO (self-host S3) for assets ──► (CDN in front to serve media) │
│ Observability: OpenTelemetry Collector → SigNoz (logs+metrics+traces) │
└──────────────────────────────────────────────────────────────────────────────────┘
6.3 Scaling plan per component (with approximate numbers)
| Component | Today | Signal to scale | Scaling action | Realistic ceiling |
|---|---|---|---|---|
| App (Next.js/API) | 1 container | Sustained CPU >70% or high p95 latency | Add replicas (stateless) behind Traefik | Tens of replicas; horizontal is trivial |
| Postgres | 1 shared 8GB instance | High RAM/IO; slow queries | (1) more vCPU/RAM on the VPS → (2) VPS dedicated to Postgres → (3) read replicas → (4) partitioning/Citus | 1 well-tuned node handles hundreds of small/medium tenants |
| PgBouncer | — | >~100 concurrent connections | Introduce it right away (pool in transaction mode); Postgres must not see thousands of connections | Thousands of logical clients over few real connections |
| Redis | — | Need for cache/queues (already in MVP) | Dedicated instance; later replica/persistence | Very high for this use |
| Queues/jobs | — | Ad sync, AI, email jobs | BullMQ (Redis) + apps/worker; scale the number of workers | Tens of thousands of jobs/min with several workers |
| Assets | VPS volume | Media growth / bandwidth | MinIO + CDN in front (Bunny/Cloudflare) | TB without touching the app |
| Orchestration | Coolify, 1 node | Need for >1 node / HA | Coolify multi-node (supports several servers) → if real HA is needed, Nomad or k3s | See §6.4 |
Estimated capacity of a single reasonable node
A VPS of 8 vCPU / 16–32 GB with tuned Postgres + PgBouncer + Redis + a replicated app comfortably serves 100+ tenants for an agency (this is not mass-consumer SaaS: the traffic is the team + clients, not millions of users). The bottleneck will arrive at Postgres (IO/RAM) long before the app. That is why the first scaling investment is to move Postgres to its own node and give it RAM.
6.4 When to migrate from Coolify to Kubernetes/Nomad?
Honest recommendation: do NOT migrate to Kubernetes unless there is a real need. For 2 people, k8s is a huge operational cost. Path:
- Today → ~50 tenants: Coolify, one node. Vertical scaling. Enough.
- ~50–150 tenants / need to isolate Postgres: Coolify multi-node (app on one node, Postgres on another). Coolify manages several servers natively.
- Need for HA, fine-grained autoscaling, or many services: evaluate Nomad (simpler than k8s, fits a small team) or k3s. Only if the numbers call for it.
Since the contract (containers + Postgres + Redis + S3) does not change, this migration is one of orchestration, not architecture. That is exactly "scaling without rebuilding."
6.5 Environments and CI/CD
- Environments:
dev(local, Docker Compose) ·staging(Coolify,staging.*domain) ·prod(Coolify). Same images promoted. - CI/CD: Forgejo Actions (we already have Forgejo): lint + typecheck + tests (incl. RLS isolation test) + build → migrate DB → deploy via the Coolify API. Auto-deploy on merge to
main(pending wiring, see runbook 03). - Migrations run in the pipeline before booting the new version.
7. Security, identity, and compliance
7.1 Single identity
- A single identity (
users) for the whole ecosystem. A user belongs to one or several organizations viamembershipswith a role. - Organization types:
agency(Feedback Studios) andclient(each client). The agency has "bridge" memberships that let it operate over client tenants with audited permissions. - Auth: httpOnly cookie sessions + (future) SSO/OAuth2 (Google login for the team). For portal clients: email+password + magic link / passkeys.
- 2FA/MFA mandatory for agency roles.
7.2 Multi-tenant RBAC + ABAC
- RBAC (roles):
agency_owner,agency_member,client_admin,client_viewer, etc. - ABAC (attributes): permissions per project and per resource (e.g., an
agency_memberonly sees assigned projects; aclient_vieweronly sees published reports of THEIR tenant). Rules evaluated in thepackages/authlayer, in addition to RLS in the DB. - Defense in depth: RLS (DB) + permission check (app) + tenant validation on every request (
SET LOCAL app.tenant_id). Three layers; none trusts the other.
7.3 Secrets, encryption, audit
- Secrets out of git (already policy). Environment variables managed by Coolify; at scale, consider a lightweight vault (Infisical self-host or OpenBao) for rotation.
- Encryption in transit (TLS everywhere) and at rest for sensitive data: each tenant's ad tokens encrypted at the column level (envelope encryption), encrypted backups.
- Immutable audit (
audit_log): every sensitive action (login, permission change, agency access to client data, exports, financial changes) is recorded with actor, tenant, IP, before/after.
7.4 GDPR / privacy (EU + US clients)
- Legal basis and residency: data on an EU VPS (IONOS); define processor/controller of the processing (the agency is the processor with respect to its clients' data).
- ARCO/GDPR rights built in: export and deletion by tenant/contact implemented as a capability (not as a manual favor). Soft-delete + scheduled purge per retention policy.
- Minimization and retention: policies per data type (messages, insights, logs) with TTL and partitions that get dropped.
- DPA and consent: record lead consent (source, timestamp). The web's cookies/tracking compliant with regulation (self-host analytics like Umami/Plausible already in place).
7.5 Backups and disaster recovery
- Postgres backups: automatic daily (Coolify) + WAL archiving for point-in-time recovery as the data's value grows.
- Off-site (justified cost exception): copy encrypted backups to cheap external object storage (Backblaze B2 / S3, cents/GB). A backup that lives only on the same VPS is not a backup. This small recurring cost is honestly justifiable.
- DR target: RPO ≤ 24h (improvable to minutes with PITR), RTO ≤ 4h. A tested restoration runbook (real periodic restore, not just "the backup exists").
- Assets (MinIO): replication/copy to external storage.
8. Observability and reliability
8.1 Recommended stack: SigNoz (self-host, OpenTelemetry-native)
Decision: instrument everything with OpenTelemetry (logs + metrics + traces) and centralize in SigNoz self-hosted.
Reason: for a small team, SigNoz offers a unified stack (replacing Loki+Tempo+Mimir+ Grafana in a single product) native to OpenTelemetry and with no self-host cost. By instrumenting with OTel, there is no lock-in: if we migrate one day to Grafana LGTM or to a managed offering, the instrumentation is preserved. (Valid alternative: OpenObserve, a single binary with S3 storage, even lighter; or the classic Grafana LGTM if you want the most battle-tested option. Any of them works as long as the base is OTel.)
8.2 What we measure
- Structured logs (JSON) with correlated
tenant_id/request_id/trace_id. - Metrics: p50/p95/p99 latency per endpoint, errors, throughput, BullMQ queue depth, ad sync lag, job health.
- Traces: request → domain → DB → external integration, to debug end-to-end.
- Error tracking: Sentry self-host (or GlitchTip, lighter) to group exceptions with context. (Possible cost exception: managed Sentry free tier if self-host weighs too heavily on the team — see §11.)
8.3 SLOs, alerts, health checks
- Initial SLOs: API availability 99.5%; p95 < 500 ms on read endpoints; ad sync completed < 1h after the day's close.
- Alerts (to Chatwoot/Telegram/email): health check failure, error rate > threshold, stuck queue, failed backup, disk/RAM at the limit, certificate about to expire.
- Health checks:
/healthz(liveness) and/readyz(readiness: DB + Redis + S3) that Coolify/Traefik query.
9. AI in the platform
Principle: AI where it adds measurable value, not for fashion. Every AI feature has a clear "job" and can be turned off.
9.1 Prioritized use cases (highest to lowest ROI)
| Case | Value | How |
|---|---|---|
| Conversation summary (Chatwoot/email) | Saves hours; instant client context | A job that summarizes threads and attaches them to the timeline. |
| Lead scoring/prioritization | Sell more by focusing effort | Model + rules over CRM data; writes lead.score. |
| Narrated reporting | Differentiator: the dashboard "explains" the ROAS | AI drafts the "why" behind the period's figures. |
| Copy/creative generation (first draft) | Speeds up production | Brief → copy/image variants; human review always. |
| Internal assistant (ask the data) | "Which clients dropped ROAS this month?" | Natural-language query over the API with the user's permissions. |
| Client portal assistant | Self-service | Scoped to THEIR tenant, read-only. |
9.2 How it integrates (AI architecture)
- "AI jobs" pattern: AI tasks are background jobs (BullMQ), not synchronous calls in the request. An
ai_jobstable with status, cost, and auditable result. - Models: the Claude API (Anthropic) for quality reasoning/summaries/copy; the option of local models (Ollama on the VPS) for cheap/sensitive tasks when the hardware allows. Abstracted behind
packages/integrations/aito switch providers without touching the domain. - Data privacy: the AI respects the tenant (only sees that tenant's data); prompts and outputs are recorded in
ai_jobs(audit); sensitive data is redacted/anonymized before leaving; client consent to process their data with AI. - Cost: budget per tenant and per job; result caching; use small/local models for the trivial and large models only where quality matters.
9.3 Claude Code as part of the operational AI
The system itself is designed so that Claude Code can operate it: declarative schemas, OpenAPI always up to date, versioned migrations, isolation tests. This turns AI into a development lever, not just a product feature.
10. Phased roadmap
Realistic for 2 people + Claude Code. Each phase leaves something sellable and does not break what came before. It builds on the existing plan (PLATFORM-PLAN.md).
Phase 0 — Foundations (already in progress / immediate)
- Public web on Coolify (done/in progress). Forgejo + basic CI/CD.
- Add: PgBouncer + Redis + MinIO to the Coolify stack; minimal OTel+SigNoz; encrypted off-site backups. Unblocks: everything else (data, queues, assets, observability).
Phase 1 — Multi-tenant core + Identity + CRM (internal MVP)
packages/dbwith base schema, enforced RLS + isolation test in CI.- Payload (Postgres) + multi-tenant plugin for admin/auth/CRUD.
- Modules identity, crm, projects, comms.
apps/dashboardprogressively replaces the PHP dashboard and the use of Notion as the truth. - Chatwoot connected to the unified timeline.
- Unblocks: a single source of truth for clients/leads/projects.
Phase 2 — Reporting + Client portal (sellable MVP)
- Meta + Google Ads integration (sync jobs →
ad_insights). - Per-client dashboards (ROAS, KPIs) + basic narrated reporting.
apps/portalwhite-label: the client sees reports and approves creatives.- Public API v1 + OpenAPI/Scalar published in docs.
- Unblocks: the "growth partners with data" pitch and client self-service. It is the first release that can be sold as a product.
Phase 3 — Automations + Billing + applied AI
- Event bus (outbox+relay) wired to n8n; per-event workflows.
- Billing/contracts (quotes, invoices, e-sign with Documenso, payments).
- AI: lead scoring, summaries, internal assistant.
- TikTok/LinkedIn Ads; assets/DAM with versions.
- Unblocks: operation with almost no manual work; upsell.
Phase 4 — Scale and polish
- Move Postgres to its own node + read replica; partition large tables.
- Coolify multi-node if needed; HA where the SLO requires it.
- Advanced attribution / warehouse if volume justifies it.
- Template marketplace (zero marginal cost).
Dependencies (what unblocks what)
Phase 0 (infra) ──► Phase 1 (data+identity+CRM) ──► Phase 2 (reporting+portal = SELLABLE)
│
▼
Phase 3 (automation+billing+AI)
│
▼
Phase 4 (scale/HA/advanced)
11. Honest risks and trade-offs
| Risk / tension | Reality | Mitigation / decision |
|---|---|---|
| "Self-host without fees" vs deliverable email | 100% own SMTP ends up in spam; invoice/portal email cannot fail. | Justified exception: use a transactional provider for critical outbound email. Options: cheap managed (Resend/Postmark, pay-per-use, no per-seat) or serious self-host (Postal/Maddy + IP with reputation). Recommendation: low-cost transactional provider for deliverability; self-host only if you take on maintaining IP reputation. |
| Backups only on the VPS | If the VPS dies, the backups die. | Justified exception: encrypted off-site in external object storage (B2/S3), cents/GB. Non-negotiable. |
| Self-host Sentry/observability is heavy | Self-host SigNoz/Sentry consume RAM and maintenance. | Start light (GlitchTip/OpenObserve). If it weighs, managed Sentry free tier is acceptable (not abusively per-seat). OTel avoids lock-in. |
| Poorly implemented RLS = data leak | The biggest risk of the chosen model. | CI that requires RLS on every table with tenant_id + A/B isolation tests + FORCE RLS + an app role without BYPASSRLS. Three layers (RLS+RBAC+SET LOCAL). |
| Lock-in to Payload | If Payload becomes limiting, migrating hurts. | Postgres underneath (not Mongo), critical data in our own tables with Drizzle, public API as our own contract independent of Payload. |
| Team of 2 + big ambition | Risk of over-engineering and not finishing. | Modular monolith (not microservices), sellable phases, Claude Code as a multiplier, "build what is necessary, do not gold-plate" (aligned with CLAUDE.md). |
| Postgres as the single point | Bottleneck and SPOF. | PgBouncer right away; read replica + PITR as it grows; tested DR. The model does not change when scaling. |
| Hidden cost of self-host | Operation time = real cost even without a fee. | Consciously assumed; observability + runbooks + automation with Claude Code reduce that time. |
Honest recurring total at medium scale: VPS (already exists, perhaps a larger one or a second node for Postgres) + domain + transactional email (pay-per-use) + off-site backups (cents/GB). On the order of tens of €/month, no per-seat, no feature-gating. Consistent with the owner's philosophy and, at the same time, robust at scale.
12. Executive summary of decisions
- Multi-tenancy = row-per-tenant + enforced RLS in PostgreSQL. Same schema for all, isolation in the DB (not in the code), trivial agency reporting, migrations only once, marginal cost per client ~0. Escape route to schema/DB-per-tenant only for a future enterprise.
- Service architecture = modular monolith + workers, with clearly bounded domain modules and events (outbox). No microservices for a team of 2. Ready to decompose if one day needed.
- Payload CMS (on Postgres) as the admin/auth/CMS backbone, but with the critical financial/analytical data in our own tables with RLS and a contractual public API (OpenAPI/Zod) independent of Payload → speed without lock-in.
- API-first + event-driven: public REST/JSON (autogenerated OpenAPI 3.1, living docs with Scalar), GraphQL scoped to the internal dashboard, idempotency, cursor pagination, signed webhooks. Integrations behind an anti-corruption layer.
- Infra: Coolify one node → multi-node, vertical scaling first. PgBouncer + Redis + MinIO + CDN from the MVP. Move Postgres to its own node and a read replica when the numbers call for it. k8s/Nomad only if real HA requires it. Scaling = more machine/nodes, not rebuilding.
- Layered security: RLS + RBAC/ABAC +
SET LOCAL app.tenant_idper request; secrets out of git; per-column token encryption; immutable audit; GDPR built in (export/deletion by tenant); encrypted off-site backups with tested DR. - OpenTelemetry observability → SigNoz self-host (no lock-in), GlitchTip/Sentry error tracking, SLOs and alerts from before the first paying client.
- AI as auditable, tenant-aware "jobs": summaries, lead scoring, narrated reporting, copy generation (with human review). Claude API + local option (Ollama), behind a provider abstraction.
- Honest, bounded cost exceptions: deliverable transactional email and off-site backups (and optionally error tracking). Everything else, self-host. Tens of €/month, no per-seat.
- Roadmap in sellable phases: Phase 2 (reporting + client portal) is the first commercializable release; each phase does not break the previous one.
Each one in one line
- Recommended multi-tenant model: shared row-per-tenant with
tenant_idand enforced Row-Level Security in PostgreSQL (one schema for all), with an escape path to schema/DB-per- tenant for a future enterprise. - Recommended service architecture: modular monolith in TypeScript (Next.js + Payload on Postgres) with domain modules, events via outbox, and background workers, deployed by Coolify and ready to decompose if needed.
Sources (2026 best practices consulted)
- Multi-tenant RLS vs schema-per-tenant in Postgres (2026 consensus: start shared+RLS, escape to schema/DB for enterprise): propelius.tech, thenile.dev, oneuptime.com, 1xapi.com
- Payload CMS multi-tenant at scale (500+/thousands of tenants with the right infra): payloadcms.com/docs/plugins/multi-tenant, execudea.com
- Job queues (BullMQ/Redis vs pg-boss): pkgpulse.com, bullmq.io
- OTel-native self-host observability (SigNoz/OpenObserve vs LGTM): signoz.io, parseable.com