SoConnective — Documentation

Date: 2026-06-09
Status: Proposal (master design) — supersedes and extends ADR 0001
Authors: Claude Code (development team) · Pamela (product)
Audience: Feedback Studios team (2 people) + Claude Code as the engineering team.

Master architecture document. It does not describe "what exists today," but rather the target system and the path to reach it without rebuilding anything when there are 100+ clients. It is intentionally ambitious: the stated goal is to be the best marketing agency platform on the market, self-host-first, with no recurring fees except where honestly justifiable at scale.

Vision and architecture principles
Product scope and capabilities
Data architecture and multi-tenancy
Service and application architecture
API and integration contracts
Infrastructure and scaling
Security, identity, and compliance
Observability and reliability
AI in the platform
Phased roadmap
Honest risks and trade-offs
Executive summary of decisions

1. Vision and architecture principles

1.1 What "the best agency system" means

It is not "many features." It is a single business operating system where:

A lead arriving through the website, a message in Chatwoot, a Meta Ads campaign, an invoice, and a project task all live in the same data graph, with a single client identity and a single history.
The agency proves value with data (ROAS, attribution, per-client reporting) instead of attaching screenshots.
The end client logs in to a branded portal and sees their campaigns, approves creatives, and pays, without loose emails.
Everything is automatable (n8n + events) and buildable by Claude Code (a single language, a single repo, typed contracts).

The competitive advantage is not the code: it is total integration + unified data + marginal cost near zero per new client.

1.2 Principles (non-negotiable)

#	Principle	Concrete implication
P1	Multi-tenant from day 1	`tenant_id` on every piece of client data; isolation enforced in the DB, not in application code.
P2	Single source of truth	A single relational database (PostgreSQL) as the source; Notion/Chatwoot/n8n are mirrors/actuators, never the truth.
P3	API-first	Every capability exists first as a typed API before becoming a screen. The UI consumes its own public API.
P4	Event-driven	State changes emit domain events (outbox). Integrations react to events; they are not coupled to the monolith.
P5	Self-host-first, pragmatic	Open-source on the VPS by default. We pay only where self-hosting puts the business at risk (deliverable email, off-site backups, perhaps errors). Documented and reversible.
P6	Scalable to 100+ (and 1000+) without rebuilding	Decisions that do NOT change shape as we grow: the multi-tenant model, the API contract, and the event bus are chosen so that scaling is "more machine/more nodes," not "rewrite."
P7	Secure by default	Deny-by-default, RLS in Postgres, secrets out of git, immutable audit, GDPR built in.
P8	Observable	Centralized logs/metrics/traces and SLOs from before the first paying client.
P9	Automatable with Claude Code	TypeScript end-to-end, declarative schemas, autogenerated OpenAPI, versioned migrations, tests. The system documents itself so the AI can operate it.
P10	Low marginal cost per tenant	A new client = new rows + config, not new infrastructure.

1.3 The central tension (and how it is resolved)

"Self-hosted without subscriptions" vs "the best system on the market at scale."

It is resolved with one rule: self-host by default; pay only when a self-host failure puts the business or the client's data at risk. Three honest exceptions (detailed in §11): deliverable transactional email, off-site backups, and (optional) managed error tracking. Everything else runs on the VPS. The realistic recurring total at medium scale: tens of €/month, not thousands.

2. Product scope and capabilities

The platform is organized into domain modules (not microservices — see §4). Each module is a bounded context with its own entities, its own API, and its own events.

2.1 Module map and priority

Module	Capability	Priority	Notes
Identity and tenancy	Users, organizations (agency/client), memberships, roles, SSO	MVP	Foundation of everything. A single identity.
CRM	Contacts, companies, leads, pipeline, scoring, activities	MVP	The heart of "sell more."
Projects and tasks	Per-client projects, tasks, milestones, time-tracking	MVP	Replaces/elevates the use of Notion.
Communication	Embedded Chatwoot, inbound/outbound email, unified timeline	MVP	One inbox per client.
Reporting and analytics	Per-client dashboards, KPIs, ROAS, basic attribution	MVP	Ingestion from Meta/Google Ads.
Client portal	Client login, view reports, approve creatives, view invoices	MVP+	Same backend, separate app.
Assets and approvals	Lightweight DAM, versions, approval flow, comments	MVP+	Storage in MinIO/S3.
Campaign management	Connect ad accounts, sync multichannel campaigns/insights	Phase 2	Meta/Google first; TikTok/LinkedIn later.
Automations/workflows	Triggers → actions (via events + n8n)	Phase 2	n8n as the visual engine; events from the bus.
Billing and contracts	Quotes, invoices, contracts, e-signature, payments	Phase 2	Stripe or invoice+link; self-host e-sign (Documenso).
AI / Insights	Summaries, lead scoring, content generation, assistants	Phase 2-3	Cross-cutting; see §9.
Advanced attribution	Multi-touch, server-side tracking, data warehouse	Future	When volume justifies it.
Marketplace/templates	Reusable report, project, and campaign templates	Future	Lever for "zero marginal cost."

2.2 What makes the platform "leading" (differentiators)

Unified per-client timeline: chat + email + leads + campaigns + invoices in a single timeline. Almost no one has this well integrated.
Automatic white-label reporting: the client sees their dashboard with their brand; the data updates itself from the ad APIs. Zero manual reporting work.
Creative approvals inside the portal: the client comments and approves; that triggers events (publish, notify, bill).
AI focused on real value (not decorative chat): summarize conversations, prioritize leads, draft first versions of copy/creatives, explain the "why" behind a ROAS change.
Everything automatable: every domain event can trigger a workflow in n8n.

3. Data architecture and multi-tenancy

This is the key piece of the system. Here we decide whether scaling to 100+ clients is trivial or a rewrite. The decision is deliberately conservative and proven.

3.1 Conceptual model: shared core + per-project data with the same structure

The owner's requirement —"each project/client has isolated internal data but they ALL share the same structure"— materializes in two planes:

┌──────────────────────────── CORE (shared, cross-tenant) ───────────────────────────────┐
│  organizations  ·  users  ·  memberships  ·  roles  ·  audit_log  ·  api_keys           │
│  (the "agency" is one organization; each "client" is another organization)              │
└─────────────────────────────────────────────────────────────────────────────────────────┘
                                        │  tenant_id (= client's organization)
                                        ▼
┌──────────────────────── PER-TENANT DATA (same structure for all) ───────────────────────┐
│  clients_profile · projects · contacts · companies · leads · deals · tasks · assets      │
│  approvals · ad_accounts · campaigns · ad_insights · invoices · contracts · messages     │
│  report_dashboards · automations · ai_jobs · events_outbox                               │
│                                                                                          │
│  EVERY table in this plane carries: tenant_id (NOT NULL) + enforced RLS                   │
└─────────────────────────────────────────────────────────────────────────────────────────┘

Key to the model: "same schema for all" = a single table definition. Each client does not have different tables; they have the same tables, filtered by tenant_id. This is what makes the marginal cost of a new client ~0 and what lets a schema improvement benefit everyone at once.

3.2 Comparison of isolation strategies

Criterion	Row-per-tenant + RLS (recommended)	Schema-per-tenant	Database-per-tenant
Isolation	Strong (enforced in the DB by RLS policy)	Very strong	Maximum
Cost per new tenant	~0 (one more row)	Medium (create schema + migrate)	High (create DB + provision)
Migrations	Once, applies to all	N times (one per schema)	N times (one per DB)
Scaling to 100 tenants	Trivial	Manageable	Operationally heavy
Scaling to 1000+ tenants	Good (with indexes + partitioning)	Hundreds of schemas = catalog pain	Unfeasible without Citus-style orchestration
Data leak risk	Low if RLS is done right (deny-by-default + tests)	Low	Almost none
Cross-tenant reporting (the agency sees everything)	Trivial (1 query)	Hard (union of N schemas)	Very hard
Operation with a team of 2 + Claude Code	Optimal	Overhead	Untenable

3.3 Recommendation: row-per-tenant with PostgreSQL Row-Level Security (RLS)

Decision: a model of shared rows + tenant_id + enforced RLS in Postgres, with a documented escape path to schema/DB-per-tenant only for an "enterprise" client who may someday require contractual physical isolation.

Reasons:

It is the 2026 consensus for multi-tenant SaaS: start with shared schema + RLS for cost efficiency and operational simplicity, with a migration path to schema/DB isolation when an enterprise client demands it.
RLS moves isolation from the app to the database: even if an application query forgets the WHERE tenant_id = ... (a Claude Code bug or ours), Postgres does not return rows from another tenant. Isolation does not depend on never making mistakes.
Trivial agency reporting: the agency needs to see across tenants (global portfolio, aggregated KPIs). With rows+RLS it is a role that bypasses the policy; with schema/DB-per- tenant it would be a nightmare of joins.
Migrations only once: a schema change is applied once and benefits all 100 clients. This is decisive for a team of 2.

How isolation is enforced (concrete pattern)

-- 1) Every tenant table carries the column and the policy
ALTER TABLE leads ENABLE ROW LEVEL SECURITY;
ALTER TABLE leads FORCE ROW LEVEL SECURITY;     -- applies even to the table owner

CREATE POLICY tenant_isolation ON leads
  USING (tenant_id = current_setting('app.tenant_id')::uuid)
  WITH CHECK (tenant_id = current_setting('app.tenant_id')::uuid);

-- 2) The app connects with a role WITHOUT RLS bypass and, per request, sets the tenant:
--    SET LOCAL app.tenant_id = '<uuid-of-the-request-tenant>';
--    (LOCAL = lives only within the transaction; impossible to leak between requests)

-- 3) A separate "agency_admin" role may have BYPASSRLS for global reporting,
--    used ONLY by audited internal endpoints.

CI guarantee (non-negotiable): a CI check that fails if any table with a tenant_id column does not have RLS enabled and forced. One forgotten table = a data breach. Also, integration tests that insert as Tenant A and verify that Tenant B gets zero rows.

3.4 Base model conventions (every tenant table)

id            uuid     PRIMARY KEY DEFAULT gen_random_uuid()   -- (or UUIDv7 for temporal ordering)
tenant_id     uuid     NOT NULL  REFERENCES organizations(id)  -- logical partition
created_at    timestamptz NOT NULL DEFAULT now()
updated_at    timestamptz NOT NULL DEFAULT now()               -- update trigger
deleted_at    timestamptz NULL                                 -- soft-delete (NULL = alive)
created_by    uuid     NULL  REFERENCES users(id)
version       integer  NOT NULL DEFAULT 1                       -- optimistic locking

UUIDv7 as id (sortable by time → good indexes, without exposing counts).
Universal soft-delete: never a physical DELETE on client data without a retention policy; the app's views filter deleted_at IS NULL.
Audit: an append-only (immutable) audit_log table with who/what/when/before/after, fed by triggers + the application layer.
Indexes: every tenant-table index starts with (tenant_id, ...) so queries always scope by tenant first.

3.5 Multi-tenant migration strategy

Declarative, versioned migrations (Drizzle Kit or Payload's migrations, depending on §4) in packages/db. One migration = once for all tenants.
Forward-only in production; each migration with its documented rollback.
Expand → migrate → contract for non-trivial changes (add nullable column → batch backfill in a background job → make it NOT NULL/drop the old one), so deployment never breaks the app live.
Migrations run in CI/CD (Forgejo Actions) before booting the new version.

3.6 Partitioning and data scale (when)

Up to ~100 tenants and tables of a few million rows: one Postgres, (tenant_id, ...) indexes, no partitioning. More than enough.
High-volume tables (ad_insights, messages, events_outbox, audit_log): partition by time range (monthly) once they exceed ~50–100M rows; this enables a cheap drop of old partitions per retention policy.
At 1000+ tenants or huge tables: option to partition by tenant_id (hash) or introduce Citus (transparent Postgres sharding) — but this is "more Postgres," not a model change. The data contract (rows + tenant_id) does not change.

4. Service and application architecture

4.1 Decision: modular monolith + workers, not microservices

For a team of 2 people + Claude Code, microservices would be operational suicide (N deployments, N databases, network latency, distributed sagas, multiplied observability). The right choice is a well-structured modular monolith:

A single primary deployable (the Next.js app + the API/domain layer).
Domain modules with clear boundaries (folders + import rules + interfaces) that communicate via internal events and through their own APIs, not by reaching into the neighbor's tables.
Separate workers (same code, different process) for background jobs.
If one day a module needs to scale independently, it already has its event boundary: extracting it to a service is a bounded refactor, not a rewrite. ("Monolith ready to decompose" pattern.)

                         ┌───────────────────────────────────────────────┐
   browser    ──HTTPS──► │  Traefik (Coolify proxy, TLS Let's Encrypt)   │
                         └───────────────┬───────────────────────────────┘
            ┌──────────────────┬─────────┴──────────┬──────────────────────┐
            ▼                  ▼                     ▼                      ▼
      apps/web           apps/dashboard         apps/portal         apps/docs (OpenAPI)
   (public web)        (internal agency)       (end client)        (docs + API ref)
            └───────┬──────────┴──────────┬──────────┘
                    ▼                     ▼
        ┌───────────────────────────────────────────────────────┐
        │   CORE  (packages/core + Payload/domain layer)         │
        │   modules: identity · crm · projects · comms ·         │
        │            reporting · campaigns · billing · ai        │
        │   ── emits domain events → events_outbox ──            │
        └───────┬───────────────────────┬───────────────┬────────┘
                ▼                        ▼               ▼
          PostgreSQL (RLS)          Redis (cache+queue)  MinIO/S3 (assets)
                │                        │
                ▼                        ▼
        ┌──────────────┐         ┌──────────────────┐
        │ Worker(s)    │◄────────│ BullMQ (queues)  │
        │ jobs/sync/ai │         └──────────────────┘
        └──────┬───────┘
               │  outbox relay → webhooks/events
        ┌──────┴────────┬───────────────┬─────────────┐
        ▼               ▼               ▼             ▼
     Chatwoot          n8n          Meta/Google    Notion
     (chat)        (automation)      Ads APIs     (mirror)

4.2 Monorepo structure (target)

feedback-os/
├── apps/
│   ├── web/            # Public web (Azurio/Next.js). Already exists.
│   ├── dashboard/      # Internal agency app (Next.js).
│   ├── portal/         # White-label client portal (Next.js).
│   ├── api/            # (opt.) dedicated API/Payload server, or colocated in dashboard.
│   ├── worker/         # Background job process(es) (BullMQ workers).
│   └── docs/           # Documentation site + OpenAPI reference (Scalar).
├── packages/
│   ├── db/             # Schema, migrations, Postgres client, RLS helpers.
│   ├── core/           # Domain modules (identity, crm, projects, comms, ...).
│   ├── events/         # Domain event definitions + outbox/relay.
│   ├── api-contracts/  # Shared Zod/OpenAPI schemas (request/response types).
│   ├── integrations/   # Meta/Google/TikTok/Chatwoot/n8n/Notion clients.
│   ├── ui/             # Shared design system (see CLAUDE.md: builder rule does NOT apply here).
│   ├── auth/           # Sessions, RBAC/ABAC, SSO.
│   └── config/         # Shared ESLint/TS/Prettier/env.
└── infra/              # Runbooks, IaC, provisioning scripts.

4.3 The role of Payload CMS: data/admin backbone, not a prison

Decision: use Payload CMS (MIT, full self-host) as the content + admin + auth + ORM/collections + autogenerated API backbone, with its official multi-tenant plugin, but with three safeguards to avoid getting trapped:

Payload on PostgreSQL (Postgres adapter, not Mongo) → so RLS and SQL remain available for the critical parts. Payload's multi-tenant plugin supports thousands of tenants with the right infrastructure.
"Hard" business data (campaigns, insights, billing ledger, events) lives in tables managed by packages/db (Drizzle) with RLS, not necessarily as Payload collections. Payload shines for editorial content, assets, admin UI, and access-controlled CRUD; for heavy analytical/financial work we run the SQL ourselves.
The public API (the one consumed by the portal and third parties) is defined in packages/api-contracts with OpenAPI/Zod, so it does not depend on Payload's internal shape. If Payload is ever replaced, the contractual API stays.

In one sentence: Payload is the system's admin panel and CMS/auth; the financial and analytical truth lives in Postgres with RLS; the public API is our own contract. This gives the best of both: Payload's speed + control and independence.

Alternative considered and rejected as the single backbone: a 100% custom backend (Hono/Nest + Drizzle, without Payload). It gives total control but forces us to build admin, auth, RBAC, media, i18n, and CRUD by hand → weeks of work that Payload gives for free. We use it only alongside Payload for the parts where Payload does not fit, not in its place.

4.4 Domain layer and events

Each module exposes use cases (application functions) that: validate (Zod), apply RBAC/ABAC, execute in a transaction that sets app.tenant_id, persist, and write an event to events_outbox within the same transaction (outbox pattern → zero lost events).
A relay (in apps/worker) reads the outbox and publishes events to: BullMQ (jobs), n8n webhooks, and internal subscribers. Idempotent.

5. API and integration contracts

5.1 API style

Aspect	Decision
Primary style	REST + JSON, per-tenant resources, standard HTTP verbs.
GraphQL	Yes, but scoped: Payload exposes GraphQL for flexible reads of the internal dashboard. The public/integration API is REST (simpler to version, cache, and hand to third parties).
Definition	OpenAPI 3.1 autogenerated from Zod schemas (`packages/api-contracts`). A single source → TS types + runtime validation + spec.
Documentation	Scalar (interactive reference, MIT) served at `apps/docs` → `docs.feedback-studios.com`. Always up to date because it is generated on each build from the contract.
Versioning	`/v1/` prefix. Breaking changes → `/v2/`. Additive changes do not break. Deprecation policy with a `Sunset` header.
Authentication	Sessions (httpOnly cookie) for our own apps; API keys/OAuth2 per tenant for integrations and third parties.
Idempotency	`Idempotency-Key` header on POSTs that create resources or trigger external actions (payments, publish ads).
Pagination	Cursor-based (`?cursor=&limit=`) by default; stable and efficient at scale.
Rate-limiting	Per tenant and per API key (token bucket in Redis). Generous internal limits, strict external ones.
Errors	Uniform RFC 9457 (Problem Details) format.
Webhooks	Outbound, signed (HMAC) per tenant; retries with backoff; deliverable to n8n and to the client's systems.

5.2 Integration layer (anti-corruption)

All integrations live in packages/integrations behind our own interfaces (anti- corruption layer pattern): the domain talks about "Campaign" and "Insight" in our terms; the adapter translates to/from the external API. Thus, switching providers or Meta changing its API does not contaminate the core.

Integration	Direction	Mechanism	Notes
Meta Ads	pull insights, push campaigns	Graph API + sync jobs	Per-tenant tokens encrypted; scheduled sync.
Google Ads	pull insights	API + jobs	Same; normalize to `ad_insights`.
TikTok / LinkedIn Ads	pull insights	API + jobs	Phase 2-3; same normalized shape.
Chatwoot	bidirectional	Webhooks + API	Messages → client timeline; create/update contact.
n8n	output (trigger) and input (webhook)	Signed webhooks + API	n8n is the visual automation engine; it reacts to domain events.
Notion	outbound mirror	API	Notion stops being the truth; it syncs from Postgres for whoever still uses it.
Email	inbound and outbound	SMTP/IMAP + transactional provider	See §6.5 and §11.

Golden rule: integrations react to events and write via domain use cases (which apply RLS and audit). They never touch tables directly.

6. Infrastructure and scaling

6.1 Starting point and philosophy

Today: VPS Vidot (IONOS, Ubuntu 24.04, 4 vCPU / 8 GB / 232 GB), Coolify orchestrating containers, Forgejo (git), Next.js web, docs, + n8n and Chatwoot on separate hosts. The goal is to scale vertically and then horizontally without rebuilding, measuring before spending.

6.2 Target components

┌──────────────────────────── VPS / NODE(S) (Coolify) ─────────────────────────────┐
│  Traefik (TLS)                                                                    │
│  apps: web · dashboard · portal · docs · api · worker(s)                          │
│  PostgreSQL 17  ──  PgBouncer (pooling)  ──  read replica(s) (when needed)        │
│  Redis 7 (cache + BullMQ queues + rate-limit + sessions)                          │
│  MinIO (self-host S3) for assets  ──►  (CDN in front to serve media)              │
│  Observability: OpenTelemetry Collector → SigNoz (logs+metrics+traces)            │
└──────────────────────────────────────────────────────────────────────────────────┘

6.3 Scaling plan per component (with approximate numbers)

Component	Today	Signal to scale	Scaling action	Realistic ceiling
App (Next.js/API)	1 container	Sustained CPU >70% or high p95 latency	Add replicas (stateless) behind Traefik	Tens of replicas; horizontal is trivial
Postgres	1 shared 8GB instance	High RAM/IO; slow queries	(1) more vCPU/RAM on the VPS → (2) VPS dedicated to Postgres → (3) read replicas → (4) partitioning/Citus	1 well-tuned node handles hundreds of small/medium tenants
PgBouncer	—	>~100 concurrent connections	Introduce it right away (pool in transaction mode); Postgres must not see thousands of connections	Thousands of logical clients over few real connections
Redis	—	Need for cache/queues (already in MVP)	Dedicated instance; later replica/persistence	Very high for this use
Queues/jobs	—	Ad sync, AI, email jobs	BullMQ (Redis) + `apps/worker`; scale the number of workers	Tens of thousands of jobs/min with several workers
Assets	VPS volume	Media growth / bandwidth	MinIO + CDN in front (Bunny/Cloudflare)	TB without touching the app
Orchestration	Coolify, 1 node	Need for >1 node / HA	Coolify multi-node (supports several servers) → if real HA is needed, Nomad or k3s	See §6.4

Estimated capacity of a single reasonable node

A VPS of 8 vCPU / 16–32 GB with tuned Postgres + PgBouncer + Redis + a replicated app comfortably serves 100+ tenants for an agency (this is not mass-consumer SaaS: the traffic is the team + clients, not millions of users). The bottleneck will arrive at Postgres (IO/RAM) long before the app. That is why the first scaling investment is to move Postgres to its own node and give it RAM.

6.4 When to migrate from Coolify to Kubernetes/Nomad?

Honest recommendation: do NOT migrate to Kubernetes unless there is a real need. For 2 people, k8s is a huge operational cost. Path:

Today → ~50 tenants: Coolify, one node. Vertical scaling. Enough.
~50–150 tenants / need to isolate Postgres: Coolify multi-node (app on one node, Postgres on another). Coolify manages several servers natively.
Need for HA, fine-grained autoscaling, or many services: evaluate Nomad (simpler than k8s, fits a small team) or k3s. Only if the numbers call for it.

Since the contract (containers + Postgres + Redis + S3) does not change, this migration is one of orchestration, not architecture. That is exactly "scaling without rebuilding."

6.5 Environments and CI/CD

Environments: dev (local, Docker Compose) · staging (Coolify, staging.* domain) · prod (Coolify). Same images promoted.
CI/CD: Forgejo Actions (we already have Forgejo): lint + typecheck + tests (incl. RLS isolation test) + build → migrate DB → deploy via the Coolify API. Auto-deploy on merge to main (pending wiring, see runbook 03).
Migrations run in the pipeline before booting the new version.

7. Security, identity, and compliance

7.1 Single identity

A single identity (users) for the whole ecosystem. A user belongs to one or several organizations via memberships with a role.
Organization types: agency (Feedback Studios) and client (each client). The agency has "bridge" memberships that let it operate over client tenants with audited permissions.
Auth: httpOnly cookie sessions + (future) SSO/OAuth2 (Google login for the team). For portal clients: email+password + magic link / passkeys.
2FA/MFA mandatory for agency roles.

7.2 Multi-tenant RBAC + ABAC

RBAC (roles): agency_owner, agency_member, client_admin, client_viewer, etc.
ABAC (attributes): permissions per project and per resource (e.g., an agency_member only sees assigned projects; a client_viewer only sees published reports of THEIR tenant). Rules evaluated in the packages/auth layer, in addition to RLS in the DB.
Defense in depth: RLS (DB) + permission check (app) + tenant validation on every request (SET LOCAL app.tenant_id). Three layers; none trusts the other.

7.3 Secrets, encryption, audit

Secrets out of git (already policy). Environment variables managed by Coolify; at scale, consider a lightweight vault (Infisical self-host or OpenBao) for rotation.
Encryption in transit (TLS everywhere) and at rest for sensitive data: each tenant's ad tokens encrypted at the column level (envelope encryption), encrypted backups.
Immutable audit (audit_log): every sensitive action (login, permission change, agency access to client data, exports, financial changes) is recorded with actor, tenant, IP, before/after.

Legal basis and residency: data on an EU VPS (IONOS); define processor/controller of the processing (the agency is the processor with respect to its clients' data).
ARCO/GDPR rights built in: export and deletion by tenant/contact implemented as a capability (not as a manual favor). Soft-delete + scheduled purge per retention policy.
Minimization and retention: policies per data type (messages, insights, logs) with TTL and partitions that get dropped.
DPA and consent: record lead consent (source, timestamp). The web's cookies/tracking compliant with regulation (self-host analytics like Umami/Plausible already in place).

7.5 Backups and disaster recovery

Postgres backups: automatic daily (Coolify) + WAL archiving for point-in-time recovery as the data's value grows.
Off-site (justified cost exception): copy encrypted backups to cheap external object storage (Backblaze B2 / S3, cents/GB). A backup that lives only on the same VPS is not a backup. This small recurring cost is honestly justifiable.
DR target: RPO ≤ 24h (improvable to minutes with PITR), RTO ≤ 4h. A tested restoration runbook (real periodic restore, not just "the backup exists").
Assets (MinIO): replication/copy to external storage.

8. Observability and reliability

8.1 Recommended stack: SigNoz (self-host, OpenTelemetry-native)

Decision: instrument everything with OpenTelemetry (logs + metrics + traces) and centralize in SigNoz self-hosted.

Reason: for a small team, SigNoz offers a unified stack (replacing Loki+Tempo+Mimir+ Grafana in a single product) native to OpenTelemetry and with no self-host cost. By instrumenting with OTel, there is no lock-in: if we migrate one day to Grafana LGTM or to a managed offering, the instrumentation is preserved. (Valid alternative: OpenObserve, a single binary with S3 storage, even lighter; or the classic Grafana LGTM if you want the most battle-tested option. Any of them works as long as the base is OTel.)

8.2 What we measure

Structured logs (JSON) with correlated tenant_id/request_id/trace_id.
Metrics: p50/p95/p99 latency per endpoint, errors, throughput, BullMQ queue depth, ad sync lag, job health.
Traces: request → domain → DB → external integration, to debug end-to-end.
Error tracking: Sentry self-host (or GlitchTip, lighter) to group exceptions with context. (Possible cost exception: managed Sentry free tier if self-host weighs too heavily on the team — see §11.)

8.3 SLOs, alerts, health checks

Initial SLOs: API availability 99.5%; p95 < 500 ms on read endpoints; ad sync completed < 1h after the day's close.
Alerts (to Chatwoot/Telegram/email): health check failure, error rate > threshold, stuck queue, failed backup, disk/RAM at the limit, certificate about to expire.
Health checks: /healthz (liveness) and /readyz (readiness: DB + Redis + S3) that Coolify/Traefik query.

9. AI in the platform

Principle: AI where it adds measurable value, not for fashion. Every AI feature has a clear "job" and can be turned off.

9.1 Prioritized use cases (highest to lowest ROI)

Case	Value	How
Conversation summary (Chatwoot/email)	Saves hours; instant client context	A job that summarizes threads and attaches them to the timeline.
Lead scoring/prioritization	Sell more by focusing effort	Model + rules over CRM data; writes `lead.score`.
Narrated reporting	Differentiator: the dashboard "explains" the ROAS	AI drafts the "why" behind the period's figures.
Copy/creative generation (first draft)	Speeds up production	Brief → copy/image variants; human review always.
Internal assistant (ask the data)	"Which clients dropped ROAS this month?"	Natural-language query over the API with the user's permissions.
Client portal assistant	Self-service	Scoped to THEIR tenant, read-only.

9.2 How it integrates (AI architecture)

"AI jobs" pattern: AI tasks are background jobs (BullMQ), not synchronous calls in the request. An ai_jobs table with status, cost, and auditable result.
Models: the Claude API (Anthropic) for quality reasoning/summaries/copy; the option of local models (Ollama on the VPS) for cheap/sensitive tasks when the hardware allows. Abstracted behind packages/integrations/ai to switch providers without touching the domain.
Data privacy: the AI respects the tenant (only sees that tenant's data); prompts and outputs are recorded in ai_jobs (audit); sensitive data is redacted/anonymized before leaving; client consent to process their data with AI.
Cost: budget per tenant and per job; result caching; use small/local models for the trivial and large models only where quality matters.

9.3 Claude Code as part of the operational AI

The system itself is designed so that Claude Code can operate it: declarative schemas, OpenAPI always up to date, versioned migrations, isolation tests. This turns AI into a development lever, not just a product feature.

10. Phased roadmap

Realistic for 2 people + Claude Code. Each phase leaves something sellable and does not break what came before. It builds on the existing plan (PLATFORM-PLAN.md).

Phase 0 — Foundations (already in progress / immediate)

Public web on Coolify (done/in progress). Forgejo + basic CI/CD.
Add: PgBouncer + Redis + MinIO to the Coolify stack; minimal OTel+SigNoz; encrypted off-site backups. Unblocks: everything else (data, queues, assets, observability).

Phase 1 — Multi-tenant core + Identity + CRM (internal MVP)

packages/db with base schema, enforced RLS + isolation test in CI.
Payload (Postgres) + multi-tenant plugin for admin/auth/CRUD.
Modules identity, crm, projects, comms. apps/dashboard progressively replaces the PHP dashboard and the use of Notion as the truth.
Chatwoot connected to the unified timeline.
Unblocks: a single source of truth for clients/leads/projects.

Phase 2 — Reporting + Client portal (sellable MVP)

Meta + Google Ads integration (sync jobs → ad_insights).
Per-client dashboards (ROAS, KPIs) + basic narrated reporting.
apps/portal white-label: the client sees reports and approves creatives.
Public API v1 + OpenAPI/Scalar published in docs.
Unblocks: the "growth partners with data" pitch and client self-service. It is the first release that can be sold as a product.

Phase 3 — Automations + Billing + applied AI

Event bus (outbox+relay) wired to n8n; per-event workflows.
Billing/contracts (quotes, invoices, e-sign with Documenso, payments).
AI: lead scoring, summaries, internal assistant.
TikTok/LinkedIn Ads; assets/DAM with versions.
Unblocks: operation with almost no manual work; upsell.

Phase 4 — Scale and polish

Move Postgres to its own node + read replica; partition large tables.
Coolify multi-node if needed; HA where the SLO requires it.
Advanced attribution / warehouse if volume justifies it.
Template marketplace (zero marginal cost).

Dependencies (what unblocks what)

Phase 0 (infra) ──► Phase 1 (data+identity+CRM) ──► Phase 2 (reporting+portal = SELLABLE)
                                                          │
                                                          ▼
                                              Phase 3 (automation+billing+AI)
                                                          │
                                                          ▼
                                                   Phase 4 (scale/HA/advanced)

11. Honest risks and trade-offs

Risk / tension	Reality	Mitigation / decision
"Self-host without fees" vs deliverable email	100% own SMTP ends up in spam; invoice/portal email cannot fail.	Justified exception: use a transactional provider for critical outbound email. Options: cheap managed (Resend/Postmark, pay-per-use, no per-seat) or serious self-host (Postal/Maddy + IP with reputation). Recommendation: low-cost transactional provider for deliverability; self-host only if you take on maintaining IP reputation.
Backups only on the VPS	If the VPS dies, the backups die.	Justified exception: encrypted off-site in external object storage (B2/S3), cents/GB. Non-negotiable.
Self-host Sentry/observability is heavy	Self-host SigNoz/Sentry consume RAM and maintenance.	Start light (GlitchTip/OpenObserve). If it weighs, managed Sentry free tier is acceptable (not abusively per-seat). OTel avoids lock-in.
Poorly implemented RLS = data leak	The biggest risk of the chosen model.	CI that requires RLS on every table with `tenant_id` + A/B isolation tests + `FORCE RLS` + an app role without BYPASSRLS. Three layers (RLS+RBAC+`SET LOCAL`).
Lock-in to Payload	If Payload becomes limiting, migrating hurts.	Postgres underneath (not Mongo), critical data in our own tables with Drizzle, public API as our own contract independent of Payload.
Team of 2 + big ambition	Risk of over-engineering and not finishing.	Modular monolith (not microservices), sellable phases, Claude Code as a multiplier, "build what is necessary, do not gold-plate" (aligned with CLAUDE.md).
Postgres as the single point	Bottleneck and SPOF.	PgBouncer right away; read replica + PITR as it grows; tested DR. The model does not change when scaling.
Hidden cost of self-host	Operation time = real cost even without a fee.	Consciously assumed; observability + runbooks + automation with Claude Code reduce that time.

Honest recurring total at medium scale: VPS (already exists, perhaps a larger one or a second node for Postgres) + domain + transactional email (pay-per-use) + off-site backups (cents/GB). On the order of tens of €/month, no per-seat, no feature-gating. Consistent with the owner's philosophy and, at the same time, robust at scale.

12. Executive summary of decisions

Multi-tenancy = row-per-tenant + enforced RLS in PostgreSQL. Same schema for all, isolation in the DB (not in the code), trivial agency reporting, migrations only once, marginal cost per client ~0. Escape route to schema/DB-per-tenant only for a future enterprise.
Service architecture = modular monolith + workers, with clearly bounded domain modules and events (outbox). No microservices for a team of 2. Ready to decompose if one day needed.
Payload CMS (on Postgres) as the admin/auth/CMS backbone, but with the critical financial/analytical data in our own tables with RLS and a contractual public API (OpenAPI/Zod) independent of Payload → speed without lock-in.
API-first + event-driven: public REST/JSON (autogenerated OpenAPI 3.1, living docs with Scalar), GraphQL scoped to the internal dashboard, idempotency, cursor pagination, signed webhooks. Integrations behind an anti-corruption layer.
Infra: Coolify one node → multi-node, vertical scaling first. PgBouncer + Redis + MinIO + CDN from the MVP. Move Postgres to its own node and a read replica when the numbers call for it. k8s/Nomad only if real HA requires it. Scaling = more machine/nodes, not rebuilding.
Layered security: RLS + RBAC/ABAC + SET LOCAL app.tenant_id per request; secrets out of git; per-column token encryption; immutable audit; GDPR built in (export/deletion by tenant); encrypted off-site backups with tested DR.
OpenTelemetry observability → SigNoz self-host (no lock-in), GlitchTip/Sentry error tracking, SLOs and alerts from before the first paying client.
AI as auditable, tenant-aware "jobs": summaries, lead scoring, narrated reporting, copy generation (with human review). Claude API + local option (Ollama), behind a provider abstraction.
Honest, bounded cost exceptions: deliverable transactional email and off-site backups (and optionally error tracking). Everything else, self-host. Tens of €/month, no per-seat.
Roadmap in sellable phases: Phase 2 (reporting + client portal) is the first commercializable release; each phase does not break the previous one.

Each one in one line

Recommended multi-tenant model: shared row-per-tenant with tenant_id and enforced Row-Level Security in PostgreSQL (one schema for all), with an escape path to schema/DB-per- tenant for a future enterprise.
Recommended service architecture: modular monolith in TypeScript (Next.js + Payload on Postgres) with domain modules, events via outbox, and background workers, deployed by Coolify and ready to decompose if needed.

Sources (2026 best practices consulted)

Multi-tenant RLS vs schema-per-tenant in Postgres (2026 consensus: start shared+RLS, escape to schema/DB for enterprise): propelius.tech, thenile.dev, oneuptime.com, 1xapi.com
Payload CMS multi-tenant at scale (500+/thousands of tenants with the right infra): payloadcms.com/docs/plugins/multi-tenant, execudea.com
Job queues (BullMQ/Redis vs pg-boss): pkgpulse.com, bullmq.io
OTel-native self-host observability (SigNoz/OpenObserve vs LGTM): signoz.io, parseable.com

Table of Contents