SoConnective

Architecture

ADR 0002 — Feedback OS: Omnisystem architecture (multi-tenant, scalable to 100+ clients)

  • Date: 2026-06-09
  • Status: Proposal (master design) — supersedes and extends ADR 0001
  • Authors: Claude Code (development team) · Pamela (product)
  • Audience: Feedback Studios team (2 people) + Claude Code as the engineering team.

Master architecture document. It does not describe "what exists today," but rather the target system and the path to reach it without rebuilding anything when there are 100+ clients. It is intentionally ambitious: the stated goal is to be the best marketing agency platform on the market, self-host-first, with no recurring fees except where honestly justifiable at scale.


Table of Contents

  1. Vision and architecture principles
  2. Product scope and capabilities
  3. Data architecture and multi-tenancy
  4. Service and application architecture
  5. API and integration contracts
  6. Infrastructure and scaling
  7. Security, identity, and compliance
  8. Observability and reliability
  9. AI in the platform
  10. Phased roadmap
  11. Honest risks and trade-offs
  12. Executive summary of decisions

1. Vision and architecture principles

1.1 What "the best agency system" means

It is not "many features." It is a single business operating system where:

  • A lead arriving through the website, a message in Chatwoot, a Meta Ads campaign, an invoice, and a project task all live in the same data graph, with a single client identity and a single history.
  • The agency proves value with data (ROAS, attribution, per-client reporting) instead of attaching screenshots.
  • The end client logs in to a branded portal and sees their campaigns, approves creatives, and pays, without loose emails.
  • Everything is automatable (n8n + events) and buildable by Claude Code (a single language, a single repo, typed contracts).

The competitive advantage is not the code: it is total integration + unified data + marginal cost near zero per new client.

1.2 Principles (non-negotiable)

#PrincipleConcrete implication
P1Multi-tenant from day 1tenant_id on every piece of client data; isolation enforced in the DB, not in application code.
P2Single source of truthA single relational database (PostgreSQL) as the source; Notion/Chatwoot/n8n are mirrors/actuators, never the truth.
P3API-firstEvery capability exists first as a typed API before becoming a screen. The UI consumes its own public API.
P4Event-drivenState changes emit domain events (outbox). Integrations react to events; they are not coupled to the monolith.
P5Self-host-first, pragmaticOpen-source on the VPS by default. We pay only where self-hosting puts the business at risk (deliverable email, off-site backups, perhaps errors). Documented and reversible.
P6Scalable to 100+ (and 1000+) without rebuildingDecisions that do NOT change shape as we grow: the multi-tenant model, the API contract, and the event bus are chosen so that scaling is "more machine/more nodes," not "rewrite."
P7Secure by defaultDeny-by-default, RLS in Postgres, secrets out of git, immutable audit, GDPR built in.
P8ObservableCentralized logs/metrics/traces and SLOs from before the first paying client.
P9Automatable with Claude CodeTypeScript end-to-end, declarative schemas, autogenerated OpenAPI, versioned migrations, tests. The system documents itself so the AI can operate it.
P10Low marginal cost per tenantA new client = new rows + config, not new infrastructure.

1.3 The central tension (and how it is resolved)

"Self-hosted without subscriptions" vs "the best system on the market at scale."

It is resolved with one rule: self-host by default; pay only when a self-host failure puts the business or the client's data at risk. Three honest exceptions (detailed in §11): deliverable transactional email, off-site backups, and (optional) managed error tracking. Everything else runs on the VPS. The realistic recurring total at medium scale: tens of €/month, not thousands.


2. Product scope and capabilities

The platform is organized into domain modules (not microservices — see §4). Each module is a bounded context with its own entities, its own API, and its own events.

2.1 Module map and priority

ModuleCapabilityPriorityNotes
Identity and tenancyUsers, organizations (agency/client), memberships, roles, SSOMVPFoundation of everything. A single identity.
CRMContacts, companies, leads, pipeline, scoring, activitiesMVPThe heart of "sell more."
Projects and tasksPer-client projects, tasks, milestones, time-trackingMVPReplaces/elevates the use of Notion.
CommunicationEmbedded Chatwoot, inbound/outbound email, unified timelineMVPOne inbox per client.
Reporting and analyticsPer-client dashboards, KPIs, ROAS, basic attributionMVPIngestion from Meta/Google Ads.
Client portalClient login, view reports, approve creatives, view invoicesMVP+Same backend, separate app.
Assets and approvalsLightweight DAM, versions, approval flow, commentsMVP+Storage in MinIO/S3.
Campaign managementConnect ad accounts, sync multichannel campaigns/insightsPhase 2Meta/Google first; TikTok/LinkedIn later.
Automations/workflowsTriggers → actions (via events + n8n)Phase 2n8n as the visual engine; events from the bus.
Billing and contractsQuotes, invoices, contracts, e-signature, paymentsPhase 2Stripe or invoice+link; self-host e-sign (Documenso).
AI / InsightsSummaries, lead scoring, content generation, assistantsPhase 2-3Cross-cutting; see §9.
Advanced attributionMulti-touch, server-side tracking, data warehouseFutureWhen volume justifies it.
Marketplace/templatesReusable report, project, and campaign templatesFutureLever for "zero marginal cost."

2.2 What makes the platform "leading" (differentiators)

  1. Unified per-client timeline: chat + email + leads + campaigns + invoices in a single timeline. Almost no one has this well integrated.
  2. Automatic white-label reporting: the client sees their dashboard with their brand; the data updates itself from the ad APIs. Zero manual reporting work.
  3. Creative approvals inside the portal: the client comments and approves; that triggers events (publish, notify, bill).
  4. AI focused on real value (not decorative chat): summarize conversations, prioritize leads, draft first versions of copy/creatives, explain the "why" behind a ROAS change.
  5. Everything automatable: every domain event can trigger a workflow in n8n.

3. Data architecture and multi-tenancy

This is the key piece of the system. Here we decide whether scaling to 100+ clients is trivial or a rewrite. The decision is deliberately conservative and proven.

3.1 Conceptual model: shared core + per-project data with the same structure

The owner's requirement —"each project/client has isolated internal data but they ALL share the same structure"— materializes in two planes:

┌──────────────────────────── CORE (shared, cross-tenant) ───────────────────────────────┐
│  organizations  ·  users  ·  memberships  ·  roles  ·  audit_log  ·  api_keys           │
│  (the "agency" is one organization; each "client" is another organization)              │
└─────────────────────────────────────────────────────────────────────────────────────────┘
                                        │  tenant_id (= client's organization)

┌──────────────────────── PER-TENANT DATA (same structure for all) ───────────────────────┐
│  clients_profile · projects · contacts · companies · leads · deals · tasks · assets      │
│  approvals · ad_accounts · campaigns · ad_insights · invoices · contracts · messages     │
│  report_dashboards · automations · ai_jobs · events_outbox                               │
│                                                                                          │
│  EVERY table in this plane carries: tenant_id (NOT NULL) + enforced RLS                   │
└─────────────────────────────────────────────────────────────────────────────────────────┘

Key to the model: "same schema for all" = a single table definition. Each client does not have different tables; they have the same tables, filtered by tenant_id. This is what makes the marginal cost of a new client ~0 and what lets a schema improvement benefit everyone at once.

3.2 Comparison of isolation strategies

CriterionRow-per-tenant + RLS (recommended)Schema-per-tenantDatabase-per-tenant
IsolationStrong (enforced in the DB by RLS policy)Very strongMaximum
Cost per new tenant~0 (one more row)Medium (create schema + migrate)High (create DB + provision)
MigrationsOnce, applies to allN times (one per schema)N times (one per DB)
Scaling to 100 tenantsTrivialManageableOperationally heavy
Scaling to 1000+ tenantsGood (with indexes + partitioning)Hundreds of schemas = catalog painUnfeasible without Citus-style orchestration
Data leak riskLow if RLS is done right (deny-by-default + tests)LowAlmost none
Cross-tenant reporting (the agency sees everything)Trivial (1 query)Hard (union of N schemas)Very hard
Operation with a team of 2 + Claude CodeOptimalOverheadUntenable

3.3 Recommendation: row-per-tenant with PostgreSQL Row-Level Security (RLS)

Decision: a model of shared rows + tenant_id + enforced RLS in Postgres, with a documented escape path to schema/DB-per-tenant only for an "enterprise" client who may someday require contractual physical isolation.

Reasons:

  1. It is the 2026 consensus for multi-tenant SaaS: start with shared schema + RLS for cost efficiency and operational simplicity, with a migration path to schema/DB isolation when an enterprise client demands it.
  2. RLS moves isolation from the app to the database: even if an application query forgets the WHERE tenant_id = ... (a Claude Code bug or ours), Postgres does not return rows from another tenant. Isolation does not depend on never making mistakes.
  3. Trivial agency reporting: the agency needs to see across tenants (global portfolio, aggregated KPIs). With rows+RLS it is a role that bypasses the policy; with schema/DB-per- tenant it would be a nightmare of joins.
  4. Migrations only once: a schema change is applied once and benefits all 100 clients. This is decisive for a team of 2.

How isolation is enforced (concrete pattern)

-- 1) Every tenant table carries the column and the policy
ALTER TABLE leads ENABLE ROW LEVEL SECURITY;
ALTER TABLE leads FORCE ROW LEVEL SECURITY;     -- applies even to the table owner

CREATE POLICY tenant_isolation ON leads
  USING (tenant_id = current_setting('app.tenant_id')::uuid)
  WITH CHECK (tenant_id = current_setting('app.tenant_id')::uuid);

-- 2) The app connects with a role WITHOUT RLS bypass and, per request, sets the tenant:
--    SET LOCAL app.tenant_id = '<uuid-of-the-request-tenant>';
--    (LOCAL = lives only within the transaction; impossible to leak between requests)

-- 3) A separate "agency_admin" role may have BYPASSRLS for global reporting,
--    used ONLY by audited internal endpoints.

CI guarantee (non-negotiable): a CI check that fails if any table with a tenant_id column does not have RLS enabled and forced. One forgotten table = a data breach. Also, integration tests that insert as Tenant A and verify that Tenant B gets zero rows.

3.4 Base model conventions (every tenant table)

id            uuid     PRIMARY KEY DEFAULT gen_random_uuid()   -- (or UUIDv7 for temporal ordering)
tenant_id     uuid     NOT NULL  REFERENCES organizations(id)  -- logical partition
created_at    timestamptz NOT NULL DEFAULT now()
updated_at    timestamptz NOT NULL DEFAULT now()               -- update trigger
deleted_at    timestamptz NULL                                 -- soft-delete (NULL = alive)
created_by    uuid     NULL  REFERENCES users(id)
version       integer  NOT NULL DEFAULT 1                       -- optimistic locking
  • UUIDv7 as id (sortable by time → good indexes, without exposing counts).
  • Universal soft-delete: never a physical DELETE on client data without a retention policy; the app's views filter deleted_at IS NULL.
  • Audit: an append-only (immutable) audit_log table with who/what/when/before/after, fed by triggers + the application layer.
  • Indexes: every tenant-table index starts with (tenant_id, ...) so queries always scope by tenant first.

3.5 Multi-tenant migration strategy

  • Declarative, versioned migrations (Drizzle Kit or Payload's migrations, depending on §4) in packages/db. One migration = once for all tenants.
  • Forward-only in production; each migration with its documented rollback.
  • Expand → migrate → contract for non-trivial changes (add nullable column → batch backfill in a background job → make it NOT NULL/drop the old one), so deployment never breaks the app live.
  • Migrations run in CI/CD (Forgejo Actions) before booting the new version.

3.6 Partitioning and data scale (when)

  • Up to ~100 tenants and tables of a few million rows: one Postgres, (tenant_id, ...) indexes, no partitioning. More than enough.
  • High-volume tables (ad_insights, messages, events_outbox, audit_log): partition by time range (monthly) once they exceed ~50–100M rows; this enables a cheap drop of old partitions per retention policy.
  • At 1000+ tenants or huge tables: option to partition by tenant_id (hash) or introduce Citus (transparent Postgres sharding) — but this is "more Postgres," not a model change. The data contract (rows + tenant_id) does not change.

4. Service and application architecture

4.1 Decision: modular monolith + workers, not microservices

For a team of 2 people + Claude Code, microservices would be operational suicide (N deployments, N databases, network latency, distributed sagas, multiplied observability). The right choice is a well-structured modular monolith:

  • A single primary deployable (the Next.js app + the API/domain layer).
  • Domain modules with clear boundaries (folders + import rules + interfaces) that communicate via internal events and through their own APIs, not by reaching into the neighbor's tables.
  • Separate workers (same code, different process) for background jobs.
  • If one day a module needs to scale independently, it already has its event boundary: extracting it to a service is a bounded refactor, not a rewrite. ("Monolith ready to decompose" pattern.)
                         ┌───────────────────────────────────────────────┐
   browser    ──HTTPS──► │  Traefik (Coolify proxy, TLS Let's Encrypt)   │
                         └───────────────┬───────────────────────────────┘
            ┌──────────────────┬─────────┴──────────┬──────────────────────┐
            ▼                  ▼                     ▼                      ▼
      apps/web           apps/dashboard         apps/portal         apps/docs (OpenAPI)
   (public web)        (internal agency)       (end client)        (docs + API ref)
            └───────┬──────────┴──────────┬──────────┘
                    ▼                     ▼
        ┌───────────────────────────────────────────────────────┐
        │   CORE  (packages/core + Payload/domain layer)         │
        │   modules: identity · crm · projects · comms ·         │
        │            reporting · campaigns · billing · ai        │
        │   ── emits domain events → events_outbox ──            │
        └───────┬───────────────────────┬───────────────┬────────┘
                ▼                        ▼               ▼
          PostgreSQL (RLS)          Redis (cache+queue)  MinIO/S3 (assets)
                │                        │
                ▼                        ▼
        ┌──────────────┐         ┌──────────────────┐
        │ Worker(s)    │◄────────│ BullMQ (queues)  │
        │ jobs/sync/ai │         └──────────────────┘
        └──────┬───────┘
               │  outbox relay → webhooks/events
        ┌──────┴────────┬───────────────┬─────────────┐
        ▼               ▼               ▼             ▼
     Chatwoot          n8n          Meta/Google    Notion
     (chat)        (automation)      Ads APIs     (mirror)

4.2 Monorepo structure (target)

feedback-os/
├── apps/
│   ├── web/            # Public web (Azurio/Next.js). Already exists.
│   ├── dashboard/      # Internal agency app (Next.js).
│   ├── portal/         # White-label client portal (Next.js).
│   ├── api/            # (opt.) dedicated API/Payload server, or colocated in dashboard.
│   ├── worker/         # Background job process(es) (BullMQ workers).
│   └── docs/           # Documentation site + OpenAPI reference (Scalar).
├── packages/
│   ├── db/             # Schema, migrations, Postgres client, RLS helpers.
│   ├── core/           # Domain modules (identity, crm, projects, comms, ...).
│   ├── events/         # Domain event definitions + outbox/relay.
│   ├── api-contracts/  # Shared Zod/OpenAPI schemas (request/response types).
│   ├── integrations/   # Meta/Google/TikTok/Chatwoot/n8n/Notion clients.
│   ├── ui/             # Shared design system (see CLAUDE.md: builder rule does NOT apply here).
│   ├── auth/           # Sessions, RBAC/ABAC, SSO.
│   └── config/         # Shared ESLint/TS/Prettier/env.
└── infra/              # Runbooks, IaC, provisioning scripts.

4.3 The role of Payload CMS: data/admin backbone, not a prison

Decision: use Payload CMS (MIT, full self-host) as the content + admin + auth + ORM/collections + autogenerated API backbone, with its official multi-tenant plugin, but with three safeguards to avoid getting trapped:

  1. Payload on PostgreSQL (Postgres adapter, not Mongo) → so RLS and SQL remain available for the critical parts. Payload's multi-tenant plugin supports thousands of tenants with the right infrastructure.
  2. "Hard" business data (campaigns, insights, billing ledger, events) lives in tables managed by packages/db (Drizzle) with RLS, not necessarily as Payload collections. Payload shines for editorial content, assets, admin UI, and access-controlled CRUD; for heavy analytical/financial work we run the SQL ourselves.
  3. The public API (the one consumed by the portal and third parties) is defined in packages/api-contracts with OpenAPI/Zod, so it does not depend on Payload's internal shape. If Payload is ever replaced, the contractual API stays.

In one sentence: Payload is the system's admin panel and CMS/auth; the financial and analytical truth lives in Postgres with RLS; the public API is our own contract. This gives the best of both: Payload's speed + control and independence.

Alternative considered and rejected as the single backbone: a 100% custom backend (Hono/Nest + Drizzle, without Payload). It gives total control but forces us to build admin, auth, RBAC, media, i18n, and CRUD by hand → weeks of work that Payload gives for free. We use it only alongside Payload for the parts where Payload does not fit, not in its place.

4.4 Domain layer and events

  • Each module exposes use cases (application functions) that: validate (Zod), apply RBAC/ABAC, execute in a transaction that sets app.tenant_id, persist, and write an event to events_outbox within the same transaction (outbox pattern → zero lost events).
  • A relay (in apps/worker) reads the outbox and publishes events to: BullMQ (jobs), n8n webhooks, and internal subscribers. Idempotent.

5. API and integration contracts

5.1 API style

AspectDecision
Primary styleREST + JSON, per-tenant resources, standard HTTP verbs.
GraphQLYes, but scoped: Payload exposes GraphQL for flexible reads of the internal dashboard. The public/integration API is REST (simpler to version, cache, and hand to third parties).
DefinitionOpenAPI 3.1 autogenerated from Zod schemas (packages/api-contracts). A single source → TS types + runtime validation + spec.
DocumentationScalar (interactive reference, MIT) served at apps/docsdocs.feedback-studios.com. Always up to date because it is generated on each build from the contract.
Versioning/v1/ prefix. Breaking changes → /v2/. Additive changes do not break. Deprecation policy with a Sunset header.
AuthenticationSessions (httpOnly cookie) for our own apps; API keys/OAuth2 per tenant for integrations and third parties.
IdempotencyIdempotency-Key header on POSTs that create resources or trigger external actions (payments, publish ads).
PaginationCursor-based (?cursor=&limit=) by default; stable and efficient at scale.
Rate-limitingPer tenant and per API key (token bucket in Redis). Generous internal limits, strict external ones.
ErrorsUniform RFC 9457 (Problem Details) format.
WebhooksOutbound, signed (HMAC) per tenant; retries with backoff; deliverable to n8n and to the client's systems.

5.2 Integration layer (anti-corruption)

All integrations live in packages/integrations behind our own interfaces (anti- corruption layer pattern): the domain talks about "Campaign" and "Insight" in our terms; the adapter translates to/from the external API. Thus, switching providers or Meta changing its API does not contaminate the core.

IntegrationDirectionMechanismNotes
Meta Adspull insights, push campaignsGraph API + sync jobsPer-tenant tokens encrypted; scheduled sync.
Google Adspull insightsAPI + jobsSame; normalize to ad_insights.
TikTok / LinkedIn Adspull insightsAPI + jobsPhase 2-3; same normalized shape.
ChatwootbidirectionalWebhooks + APIMessages → client timeline; create/update contact.
n8noutput (trigger) and input (webhook)Signed webhooks + APIn8n is the visual automation engine; it reacts to domain events.
Notionoutbound mirrorAPINotion stops being the truth; it syncs from Postgres for whoever still uses it.
Emailinbound and outboundSMTP/IMAP + transactional providerSee §6.5 and §11.

Golden rule: integrations react to events and write via domain use cases (which apply RLS and audit). They never touch tables directly.


6. Infrastructure and scaling

6.1 Starting point and philosophy

Today: VPS Vidot (IONOS, Ubuntu 24.04, 4 vCPU / 8 GB / 232 GB), Coolify orchestrating containers, Forgejo (git), Next.js web, docs, + n8n and Chatwoot on separate hosts. The goal is to scale vertically and then horizontally without rebuilding, measuring before spending.

6.2 Target components

┌──────────────────────────── VPS / NODE(S) (Coolify) ─────────────────────────────┐
│  Traefik (TLS)                                                                    │
│  apps: web · dashboard · portal · docs · api · worker(s)                          │
│  PostgreSQL 17  ──  PgBouncer (pooling)  ──  read replica(s) (when needed)        │
│  Redis 7 (cache + BullMQ queues + rate-limit + sessions)                          │
│  MinIO (self-host S3) for assets  ──►  (CDN in front to serve media)              │
│  Observability: OpenTelemetry Collector → SigNoz (logs+metrics+traces)            │
└──────────────────────────────────────────────────────────────────────────────────┘

6.3 Scaling plan per component (with approximate numbers)

ComponentTodaySignal to scaleScaling actionRealistic ceiling
App (Next.js/API)1 containerSustained CPU >70% or high p95 latencyAdd replicas (stateless) behind TraefikTens of replicas; horizontal is trivial
Postgres1 shared 8GB instanceHigh RAM/IO; slow queries(1) more vCPU/RAM on the VPS → (2) VPS dedicated to Postgres → (3) read replicas → (4) partitioning/Citus1 well-tuned node handles hundreds of small/medium tenants
PgBouncer>~100 concurrent connectionsIntroduce it right away (pool in transaction mode); Postgres must not see thousands of connectionsThousands of logical clients over few real connections
RedisNeed for cache/queues (already in MVP)Dedicated instance; later replica/persistenceVery high for this use
Queues/jobsAd sync, AI, email jobsBullMQ (Redis) + apps/worker; scale the number of workersTens of thousands of jobs/min with several workers
AssetsVPS volumeMedia growth / bandwidthMinIO + CDN in front (Bunny/Cloudflare)TB without touching the app
OrchestrationCoolify, 1 nodeNeed for >1 node / HACoolify multi-node (supports several servers) → if real HA is needed, Nomad or k3sSee §6.4

Estimated capacity of a single reasonable node

A VPS of 8 vCPU / 16–32 GB with tuned Postgres + PgBouncer + Redis + a replicated app comfortably serves 100+ tenants for an agency (this is not mass-consumer SaaS: the traffic is the team + clients, not millions of users). The bottleneck will arrive at Postgres (IO/RAM) long before the app. That is why the first scaling investment is to move Postgres to its own node and give it RAM.

6.4 When to migrate from Coolify to Kubernetes/Nomad?

Honest recommendation: do NOT migrate to Kubernetes unless there is a real need. For 2 people, k8s is a huge operational cost. Path:

  1. Today → ~50 tenants: Coolify, one node. Vertical scaling. Enough.
  2. ~50–150 tenants / need to isolate Postgres: Coolify multi-node (app on one node, Postgres on another). Coolify manages several servers natively.
  3. Need for HA, fine-grained autoscaling, or many services: evaluate Nomad (simpler than k8s, fits a small team) or k3s. Only if the numbers call for it.

Since the contract (containers + Postgres + Redis + S3) does not change, this migration is one of orchestration, not architecture. That is exactly "scaling without rebuilding."

6.5 Environments and CI/CD

  • Environments: dev (local, Docker Compose) · staging (Coolify, staging.* domain) · prod (Coolify). Same images promoted.
  • CI/CD: Forgejo Actions (we already have Forgejo): lint + typecheck + tests (incl. RLS isolation test) + build → migrate DB → deploy via the Coolify API. Auto-deploy on merge to main (pending wiring, see runbook 03).
  • Migrations run in the pipeline before booting the new version.

7. Security, identity, and compliance

7.1 Single identity

  • A single identity (users) for the whole ecosystem. A user belongs to one or several organizations via memberships with a role.
  • Organization types: agency (Feedback Studios) and client (each client). The agency has "bridge" memberships that let it operate over client tenants with audited permissions.
  • Auth: httpOnly cookie sessions + (future) SSO/OAuth2 (Google login for the team). For portal clients: email+password + magic link / passkeys.
  • 2FA/MFA mandatory for agency roles.

7.2 Multi-tenant RBAC + ABAC

  • RBAC (roles): agency_owner, agency_member, client_admin, client_viewer, etc.
  • ABAC (attributes): permissions per project and per resource (e.g., an agency_member only sees assigned projects; a client_viewer only sees published reports of THEIR tenant). Rules evaluated in the packages/auth layer, in addition to RLS in the DB.
  • Defense in depth: RLS (DB) + permission check (app) + tenant validation on every request (SET LOCAL app.tenant_id). Three layers; none trusts the other.

7.3 Secrets, encryption, audit

  • Secrets out of git (already policy). Environment variables managed by Coolify; at scale, consider a lightweight vault (Infisical self-host or OpenBao) for rotation.
  • Encryption in transit (TLS everywhere) and at rest for sensitive data: each tenant's ad tokens encrypted at the column level (envelope encryption), encrypted backups.
  • Immutable audit (audit_log): every sensitive action (login, permission change, agency access to client data, exports, financial changes) is recorded with actor, tenant, IP, before/after.

7.4 GDPR / privacy (EU + US clients)

  • Legal basis and residency: data on an EU VPS (IONOS); define processor/controller of the processing (the agency is the processor with respect to its clients' data).
  • ARCO/GDPR rights built in: export and deletion by tenant/contact implemented as a capability (not as a manual favor). Soft-delete + scheduled purge per retention policy.
  • Minimization and retention: policies per data type (messages, insights, logs) with TTL and partitions that get dropped.
  • DPA and consent: record lead consent (source, timestamp). The web's cookies/tracking compliant with regulation (self-host analytics like Umami/Plausible already in place).

7.5 Backups and disaster recovery

  • Postgres backups: automatic daily (Coolify) + WAL archiving for point-in-time recovery as the data's value grows.
  • Off-site (justified cost exception): copy encrypted backups to cheap external object storage (Backblaze B2 / S3, cents/GB). A backup that lives only on the same VPS is not a backup. This small recurring cost is honestly justifiable.
  • DR target: RPO ≤ 24h (improvable to minutes with PITR), RTO ≤ 4h. A tested restoration runbook (real periodic restore, not just "the backup exists").
  • Assets (MinIO): replication/copy to external storage.

8. Observability and reliability

Decision: instrument everything with OpenTelemetry (logs + metrics + traces) and centralize in SigNoz self-hosted.

Reason: for a small team, SigNoz offers a unified stack (replacing Loki+Tempo+Mimir+ Grafana in a single product) native to OpenTelemetry and with no self-host cost. By instrumenting with OTel, there is no lock-in: if we migrate one day to Grafana LGTM or to a managed offering, the instrumentation is preserved. (Valid alternative: OpenObserve, a single binary with S3 storage, even lighter; or the classic Grafana LGTM if you want the most battle-tested option. Any of them works as long as the base is OTel.)

8.2 What we measure

  • Structured logs (JSON) with correlated tenant_id/request_id/trace_id.
  • Metrics: p50/p95/p99 latency per endpoint, errors, throughput, BullMQ queue depth, ad sync lag, job health.
  • Traces: request → domain → DB → external integration, to debug end-to-end.
  • Error tracking: Sentry self-host (or GlitchTip, lighter) to group exceptions with context. (Possible cost exception: managed Sentry free tier if self-host weighs too heavily on the team — see §11.)

8.3 SLOs, alerts, health checks

  • Initial SLOs: API availability 99.5%; p95 < 500 ms on read endpoints; ad sync completed < 1h after the day's close.
  • Alerts (to Chatwoot/Telegram/email): health check failure, error rate > threshold, stuck queue, failed backup, disk/RAM at the limit, certificate about to expire.
  • Health checks: /healthz (liveness) and /readyz (readiness: DB + Redis + S3) that Coolify/Traefik query.

9. AI in the platform

Principle: AI where it adds measurable value, not for fashion. Every AI feature has a clear "job" and can be turned off.

9.1 Prioritized use cases (highest to lowest ROI)

CaseValueHow
Conversation summary (Chatwoot/email)Saves hours; instant client contextA job that summarizes threads and attaches them to the timeline.
Lead scoring/prioritizationSell more by focusing effortModel + rules over CRM data; writes lead.score.
Narrated reportingDifferentiator: the dashboard "explains" the ROASAI drafts the "why" behind the period's figures.
Copy/creative generation (first draft)Speeds up productionBrief → copy/image variants; human review always.
Internal assistant (ask the data)"Which clients dropped ROAS this month?"Natural-language query over the API with the user's permissions.
Client portal assistantSelf-serviceScoped to THEIR tenant, read-only.

9.2 How it integrates (AI architecture)

  • "AI jobs" pattern: AI tasks are background jobs (BullMQ), not synchronous calls in the request. An ai_jobs table with status, cost, and auditable result.
  • Models: the Claude API (Anthropic) for quality reasoning/summaries/copy; the option of local models (Ollama on the VPS) for cheap/sensitive tasks when the hardware allows. Abstracted behind packages/integrations/ai to switch providers without touching the domain.
  • Data privacy: the AI respects the tenant (only sees that tenant's data); prompts and outputs are recorded in ai_jobs (audit); sensitive data is redacted/anonymized before leaving; client consent to process their data with AI.
  • Cost: budget per tenant and per job; result caching; use small/local models for the trivial and large models only where quality matters.

9.3 Claude Code as part of the operational AI

The system itself is designed so that Claude Code can operate it: declarative schemas, OpenAPI always up to date, versioned migrations, isolation tests. This turns AI into a development lever, not just a product feature.


10. Phased roadmap

Realistic for 2 people + Claude Code. Each phase leaves something sellable and does not break what came before. It builds on the existing plan (PLATFORM-PLAN.md).

Phase 0 — Foundations (already in progress / immediate)

  • Public web on Coolify (done/in progress). Forgejo + basic CI/CD.
  • Add: PgBouncer + Redis + MinIO to the Coolify stack; minimal OTel+SigNoz; encrypted off-site backups. Unblocks: everything else (data, queues, assets, observability).

Phase 1 — Multi-tenant core + Identity + CRM (internal MVP)

  • packages/db with base schema, enforced RLS + isolation test in CI.
  • Payload (Postgres) + multi-tenant plugin for admin/auth/CRUD.
  • Modules identity, crm, projects, comms. apps/dashboard progressively replaces the PHP dashboard and the use of Notion as the truth.
  • Chatwoot connected to the unified timeline.
  • Unblocks: a single source of truth for clients/leads/projects.

Phase 2 — Reporting + Client portal (sellable MVP)

  • Meta + Google Ads integration (sync jobs → ad_insights).
  • Per-client dashboards (ROAS, KPIs) + basic narrated reporting.
  • apps/portal white-label: the client sees reports and approves creatives.
  • Public API v1 + OpenAPI/Scalar published in docs.
  • Unblocks: the "growth partners with data" pitch and client self-service. It is the first release that can be sold as a product.

Phase 3 — Automations + Billing + applied AI

  • Event bus (outbox+relay) wired to n8n; per-event workflows.
  • Billing/contracts (quotes, invoices, e-sign with Documenso, payments).
  • AI: lead scoring, summaries, internal assistant.
  • TikTok/LinkedIn Ads; assets/DAM with versions.
  • Unblocks: operation with almost no manual work; upsell.

Phase 4 — Scale and polish

  • Move Postgres to its own node + read replica; partition large tables.
  • Coolify multi-node if needed; HA where the SLO requires it.
  • Advanced attribution / warehouse if volume justifies it.
  • Template marketplace (zero marginal cost).

Dependencies (what unblocks what)

Phase 0 (infra) ──► Phase 1 (data+identity+CRM) ──► Phase 2 (reporting+portal = SELLABLE)


                                              Phase 3 (automation+billing+AI)


                                                   Phase 4 (scale/HA/advanced)

11. Honest risks and trade-offs

Risk / tensionRealityMitigation / decision
"Self-host without fees" vs deliverable email100% own SMTP ends up in spam; invoice/portal email cannot fail.Justified exception: use a transactional provider for critical outbound email. Options: cheap managed (Resend/Postmark, pay-per-use, no per-seat) or serious self-host (Postal/Maddy + IP with reputation). Recommendation: low-cost transactional provider for deliverability; self-host only if you take on maintaining IP reputation.
Backups only on the VPSIf the VPS dies, the backups die.Justified exception: encrypted off-site in external object storage (B2/S3), cents/GB. Non-negotiable.
Self-host Sentry/observability is heavySelf-host SigNoz/Sentry consume RAM and maintenance.Start light (GlitchTip/OpenObserve). If it weighs, managed Sentry free tier is acceptable (not abusively per-seat). OTel avoids lock-in.
Poorly implemented RLS = data leakThe biggest risk of the chosen model.CI that requires RLS on every table with tenant_id + A/B isolation tests + FORCE RLS + an app role without BYPASSRLS. Three layers (RLS+RBAC+SET LOCAL).
Lock-in to PayloadIf Payload becomes limiting, migrating hurts.Postgres underneath (not Mongo), critical data in our own tables with Drizzle, public API as our own contract independent of Payload.
Team of 2 + big ambitionRisk of over-engineering and not finishing.Modular monolith (not microservices), sellable phases, Claude Code as a multiplier, "build what is necessary, do not gold-plate" (aligned with CLAUDE.md).
Postgres as the single pointBottleneck and SPOF.PgBouncer right away; read replica + PITR as it grows; tested DR. The model does not change when scaling.
Hidden cost of self-hostOperation time = real cost even without a fee.Consciously assumed; observability + runbooks + automation with Claude Code reduce that time.

Honest recurring total at medium scale: VPS (already exists, perhaps a larger one or a second node for Postgres) + domain + transactional email (pay-per-use) + off-site backups (cents/GB). On the order of tens of €/month, no per-seat, no feature-gating. Consistent with the owner's philosophy and, at the same time, robust at scale.


12. Executive summary of decisions

  1. Multi-tenancy = row-per-tenant + enforced RLS in PostgreSQL. Same schema for all, isolation in the DB (not in the code), trivial agency reporting, migrations only once, marginal cost per client ~0. Escape route to schema/DB-per-tenant only for a future enterprise.
  2. Service architecture = modular monolith + workers, with clearly bounded domain modules and events (outbox). No microservices for a team of 2. Ready to decompose if one day needed.
  3. Payload CMS (on Postgres) as the admin/auth/CMS backbone, but with the critical financial/analytical data in our own tables with RLS and a contractual public API (OpenAPI/Zod) independent of Payload → speed without lock-in.
  4. API-first + event-driven: public REST/JSON (autogenerated OpenAPI 3.1, living docs with Scalar), GraphQL scoped to the internal dashboard, idempotency, cursor pagination, signed webhooks. Integrations behind an anti-corruption layer.
  5. Infra: Coolify one node → multi-node, vertical scaling first. PgBouncer + Redis + MinIO + CDN from the MVP. Move Postgres to its own node and a read replica when the numbers call for it. k8s/Nomad only if real HA requires it. Scaling = more machine/nodes, not rebuilding.
  6. Layered security: RLS + RBAC/ABAC + SET LOCAL app.tenant_id per request; secrets out of git; per-column token encryption; immutable audit; GDPR built in (export/deletion by tenant); encrypted off-site backups with tested DR.
  7. OpenTelemetry observability → SigNoz self-host (no lock-in), GlitchTip/Sentry error tracking, SLOs and alerts from before the first paying client.
  8. AI as auditable, tenant-aware "jobs": summaries, lead scoring, narrated reporting, copy generation (with human review). Claude API + local option (Ollama), behind a provider abstraction.
  9. Honest, bounded cost exceptions: deliverable transactional email and off-site backups (and optionally error tracking). Everything else, self-host. Tens of €/month, no per-seat.
  10. Roadmap in sellable phases: Phase 2 (reporting + client portal) is the first commercializable release; each phase does not break the previous one.

Each one in one line

  • Recommended multi-tenant model: shared row-per-tenant with tenant_id and enforced Row-Level Security in PostgreSQL (one schema for all), with an escape path to schema/DB-per- tenant for a future enterprise.
  • Recommended service architecture: modular monolith in TypeScript (Next.js + Payload on Postgres) with domain modules, events via outbox, and background workers, deployed by Coolify and ready to decompose if needed.

Sources (2026 best practices consulted)

Previous
ADR 0001 — Stack and unified platform