CYBERSIREN — SYSTEM ARCHITECTURE SPECIFICATION

Document: ARCH-SPEC-v2.2 Date: 2026-03-15 Status: Implementation Baseline — Approved Architecture with Tracked Deltas Classification: Internal / Graduation Project
Service (thick border = entry point)
Kafka Topic
Database Operation
R = Read   W = Write   R/W = Both
All scores: integer 0–100   All confidence: float 0.0–1.0

1. Complete Pipeline — Step-by-Step Data Flow

Canonical Identifier Strategy:
email_id (UUIDv7) = logical identifier. Assigned by SVC-01, carried in all Kafka messages and Redis keys, used as partition key.
internal_id (BIGSERIAL) = physical DB primary key. Auto-generated on INSERT. Part of composite PK (internal_id, fetched_at) required by partitioning.
fetched_at = partition key for the emails table. Part of composite PK. Not part of logical identity.
message_id (RFC 5322) = deduplication key scoped per org (org_id, message_id). Not globally unique.
Mapping: emails.scored message includes both email_id and (internal_id, fetched_at) so SVC-08 can write to DB without an extra lookup.
Data Role Taxonomy: PostgreSQL = source of record (authoritative state). Redis = ephemeral coordination/cache (lossy, rebuildable from DB). Kafka = transport/replay (bounded retention 24h–7d). Materialized Views = read optimization (never the source of truth). Object Storage (S3/MinIO) = binary retention (attachment files, 90-day TTL).

Each numbered step shows: the service performing work, the Kafka topic it produces to (if any), and the database operations (if any). Every field in every message is listed.

1
EMAIL INGESTION SERVICE (Go) — SVC-01
Receives raw email from one of 5 adapters (see §2.1). Normalizes to internal ParsedEmail envelope. Assigns email_id (UUIDv7), org_id (from API key auth), fetched_at (now). Performs idempotent dedup: checks (org_id, message_id) in Redis set (TTL 7d) → if present, discard. On cache miss, falls back to DB unique constraint uq_emails_org_message_id.
Message schema — emails.raw:
{ email_id, org_id, fetched_at, raw_rfc822: <full email bytes base64>, source_adapter: "gmail"|"outlook"|"imap"|"api"|"custom", api_key_id }
2
EMAIL PARSER SERVICE (Go) — SVC-02
Consumes emails.raw. Parses MIME structure. Produces 5 output topics (fan-out). Extracts: all URLs from HTML (href/src/action/meta-refresh) + plain text (regex). Hashes each attachment (SHA256 + MD5 + SHA1), computes Shannon entropy, detects actual file type via magic bytes. Extracts all header authentication results. Strips HTML → clean plain text for NLP. Builds Analysis Plan listing which downstream scores to expect based on email contents. Writes initial email record + junction tables to PostgreSQL.
Output message schemas from Email Parser:

analysis.urls (6 partitions · 24h · Key: email_id):
{ email_id, org_id, urls: [{ url: string, visible_text: string, position: "body"|"header", html_context: "href"|"src"|"action"|"plain_text" }] }

analysis.headers (6 partitions · 24h · Key: email_id):
{ email_id, org_id, sender_email, sender_domain, sender_name, reply_to_email, return_path, originating_ip, x_originating_ip, auth_spf: "pass"|"fail"|"softfail"|"none", auth_dkim: "pass"|"fail"|"none", auth_dmarc: "pass"|"fail"|"none", auth_arc, mailer_agent, in_reply_to, references_list: string[], sent_timestamp: int64, content_charset, precedence, list_id, hop_count: int, received_chain: [{from, by, timestamp}], vendor_security_tags: json, headers_json: json }

analysis.attachments (3 partitions · 24h · Key: email_id):
{ email_id, org_id, attachments: [{ sha256, md5, sha1, filename, content_type, content_id, disposition: "inline"|"attachment", size_bytes: int64, entropy: float64, detected_type: string (magic-byte) }] }

analysis.text (6 partitions · 24h · Key: email_id):
{ email_id, org_id, subject, sender_name, sender_domain, plain_text: string (HTML-stripped body), body_language: string (detected), word_count: int }

analysis.plans (6 partitions · 24h · Key: email_id):
{ email_id, org_id, expected_scores: ["url","header","attachment","nlp"] (subset based on email contents), url_count: int, attachment_count: int, has_body: bool, created_at: timestamp }
3a
URL ANALYSIS SERVICE (Go + Python subprocess) — SVC-03
Consumes analysis.urls. For each URL in the message:
(i) Redis cache check: key url_score:{sha256(url)} TTL 6h → if hit, reuse score, skip remaining steps.
(ii) TI feed match: lookup URL domain against ti_indicators table (indexed on indicator_value, indicator_type). If exact match with risk_score ≥ 80 → emit score 95, ti_matched=true, skip ML. Also checks enriched_threats to see if this URL was observed and enriched in a prior email.
(iii) Enrichment (concurrency limit: 5 goroutines): WHOIS (registrar, creation_date, expiry_date, name_servers), SSL cert (issuer, subject, valid_from, valid_to, self-signed check), DNS (A/MX/NS/CNAME), HTTP probe (status_code, redirect_count, final_url, response_time_ms), IP geo (country, asn, asn_name, isp, city).
(iv) ML inference: Enriched features → XGBoost model via Python subprocess (JSON stdin/stdout). Timeout: 5s. On timeout: score=50.
Per-email URL score = max(individual URL scores). Each URL gets its own entry in enrichment tables.
Message schema — scores.url:
{ email_id, org_id, component: "url", score: int 0-100, url_count: int, ti_blocked_count: int, ml_scored_count: int, cache_hit_count: int, riskiest_url: string, url_details: [{ url, domain, ti_matched: bool, ti_source: string|null, domain_age_days: int|null, has_ssl: bool, ssl_issuer: string|null, redirect_count: int, ml_score: int|null, is_shortened: bool, enrichment_available: bool }], processing_time_ms: int }
3b
HEADER ANALYSIS SERVICE (Go) — SVC-04
Consumes analysis.headers. Produces header_risk_score (0–100) from three analysis dimensions:
(i) Authentication verification: SPF, DKIM, DMARC, ARC result evaluation. From/Reply-To/Return-Path mismatch detection.
(ii) Sender reputation: Domain age analysis (via WHOIS), TI matches on sender domain and originating IP, typosquatting detection (Damerau-Levenshtein distance to known brands), free-provider contextual signals.
(iii) Structural anomaly detection: HTML-only messages, hidden text elements, embedded forms, encoding anomalies, excessive received hops, timestamp drift.
Note: Signal categories are architecturally defined. Specific scoring weights and thresholds are preliminary heuristics subject to empirical calibration.
All signals implemented as configurable rules in the rules table. Each signal that fires creates a rule_hits row for auditability.
Message schema — scores.header:
{ email_id, org_id, component: "header", score: int 0-100, auth_sub_score: int, reputation_sub_score: int, structural_sub_score: int, fired_rules: [{ rule_id, rule_name, rule_version, score_impact: int, match_detail: json }], signals: { spf_result, dkim_result, dmarc_result, from_reply_to_match: bool, from_return_path_match: bool, domain_age_days: int|null, typosquat_target: string|null, typosquat_distance: int|null, is_free_provider: bool, hop_count: int, time_drift_hours: float }, processing_time_ms: int }
3c
ATTACHMENT ANALYSIS SERVICE (Go) — SVC-05
Consumes analysis.attachments. For each attachment:
(i) Hash lookup: attachment_library.sha256 → if is_malicious=true, score=90.
(ii) External TI check (if VirusTotal API key configured): query hash → cache in enrichment_results provider="virustotal". TTL: 24h.
(iii) Heuristics: entropy >7.5=+20. Extension vs MIME mismatch (e.g. .pdf with application/x-msdownload)=+30. Dangerous extensions (.exe,.scr,.bat,.ps1,.vbs,.js,.wsf,.hta,.cmd)=+25. Macro-enabled Office (.docm,.xlsm,.pptm)=+20. Double extension (.pdf.exe)=+35.
Per-email attachment score = max(individual attachment scores).
Message schema — scores.attachment:
{ email_id, org_id, component: "attachment", score: int 0-100, attachment_count: int, malicious_count: int, attachment_details: [{ sha256, filename, content_type, size_bytes, entropy: float, is_malicious: bool, ti_source: string|null, extension_mismatch: bool, has_macros: bool, is_dangerous_extension: bool, individual_score: int }], processing_time_ms: int }
3d
NLP ANALYSIS SERVICE (Python FastAPI + ONNX Runtime) — SVC-06
Consumes analysis.text. Produces content_risk_score (0–100) and structured facet outputs from 4 analysis dimensions:
(i) Urgency detection: Identifies false-urgency language patterns. Output: urgency_score 0.0–1.0.
(ii) Intent classification: Classifies email goal. Multi-class: credential_harvesting, malware_delivery, bec, scam, legitimate. Output: intent_label + confidence.
(iii) Brand impersonation: Detects brand impersonation through NER + sender identity analysis. Output: impersonation_score 0.0–1.0.
(iv) Deception language: Identifies manipulative linguistic patterns. Output: deception_score 0.0–1.0.
Note: Analysis facets are architecturally defined. Model architectures (candidate: DistilBERT, TinyBERT), training methodology, and score composition method are under active research.
Individual facet scores are always preserved alongside any composite score for transparency.
Timeout: 10s. On timeout: score=50. Stateless — no DB access.
Message schema — scores.nlp:
{ email_id, org_id, component: "nlp", score: int 0-100, facets: { urgency_score: float, intent_label: "credential_harvesting"|"malware_delivery"|"bec"|"scam"|"legitimate", intent_confidence: float, impersonation_score: float, impersonated_brand: string|null, deception_score: float }, model_versions: { urgency: string, intent: string, impersonation: string, deception: string }, processing_time_ms: int }
4
SCORE AGGREGATOR SERVICE (Go + Redis state) — SVC-07
Consumes 5 topics: analysis.plans + scores.url + scores.header + scores.attachment + scores.nlp.
Maintains per-email state in Redis: Key: aggregator:{email_id} — TTL 120s.
Design: Stateless workers with shared Redis state. Events for the same email may be processed by different instances. Redis is the single source of truth for aggregation state, not instance-local memory.
Function: Gather/synchronization barrier only — does NOT perform scoring. Collects all component scores and packages them into a single payload for the Decision Engine.
Completion check: After each score received, compare received keys to expected list. All present → package and publish to emails.scored. Not all → wait.
Timeout: 30s → emit partial result with partial_analysis: true and list of missing components.
Message schema — emails.scored:
{ email_id, org_id, internal_id: bigint, fetched_at: timestamp, url_score: int|null, header_score: int, attachment_score: int|null, nlp_score: int|null, partial_analysis: bool, missing_components: string[], component_details: { url: <full scores.url message>, header: <full scores.header message>, attachment: <full scores.attachment message>|null, nlp: <full scores.nlp message>|null }, aggregation_latency_ms: int, timeout_triggered: bool }
5
DECISION ENGINE / VERDICT SERVICE (Go) — SVC-08
Consumes emails.scored. Terminal processing block — responsible for final scoring, verdict, campaign management, and persistence:
(i) Final score computation: Receives individual component scores (url, header, attachment, nlp) from Aggregator. Combines into a single risk_score (0–100). Combination method is under research — must be explainable; individual component scores always preserved.
(ii) Rule engine: Load active rules from rules table (60s cache). Each firing rule adjusts score by score_impact (-100 to +100).
(iii) Verdict: Map final score to label. Preliminary thresholds (subject to calibration): 0–25 → benign | 26–50 → suspicious | 51–75 → phishing | 76–100 → phishing (high) or malware. Confidence computed separately from risk score — reflects label certainty, not severity.
(iv) Campaign management: fingerprint = SHA256(sender_domain|url_domain|subject_template|intent). UPSERT into campaigns. SimHash near-duplicate detection (per-org scoped via Redis).
(v) Campaign-informed scoring: Campaign history (avg risk_score, email count, analyst verdicts) serves as an additional signal for future emails matching the same campaign. Specific mechanism under research.
(vi) All DB writes within a single PostgreSQL transaction.
Message schema — emails.verdict:
{ email_id, org_id, verdict_label: "benign"|"suspicious"|"phishing"|"malware"|"spam"|"unknown", confidence: float 0-1, risk_score: int 0-100, header_risk_score: int, content_risk_score: int, url_risk_score: int, attachment_risk_score: int, campaign_id: int|null, campaign_fingerprint: string|null, is_new_campaign: bool, fired_rules: [{ rule_id, rule_name, score_impact }], verdict_source: "model"|"rule", model_version: string, partial_analysis: bool, processing_time_total_ms: int }
6a
NOTIFICATION SERVICE (Go) — SVC-09
Consumes emails.verdict. If risk_score ≥ threshold OR verdict_label ∈ {phishing, malware} → send alerts.
Channels: email to org admins, webhook POST (SIEM/SOAR), Slack/Teams.
Rate limit: max 1 alert per campaign_id per org per hour (Redis counter: notif:{org_id}:{campaign_id} TTL 3600s).
6b
API / DASHBOARD SERVICE (Go REST + React SPA + WebSocket) — SVC-10
Consumes emails.verdict internally for WebSocket live feed.
REST API: /api/v1/emails, /campaigns, /rules (CRUD), /api-keys (CRUD), /stats (MVs), /feeds, /verdicts (analyst POST).
Auth: JWT + API keys. Scoped per org via org_id.
BG
TI FEED SYNC SERVICE (Go) — SVC-11Independent background service, not in main pipeline.
Runs on schedule (default: every 6h, configurable per feed). For each enabled feed: fetch → parse (CSV/JSON/STIX) → deduplicate → bulk UPSERT into ti_indicators.
After all feeds: REFRESH MATERIALIZED VIEW CONCURRENTLY for all 5 views. Update Redis TI cache.
Feeds: PhishTank, OpenPhish, URLhaus, MalwareBazaar (hashes → attachment_library), Abuse.ch ThreatFox, custom feeds.

2. Service Detail Sheets

2.1 Email Ingestion Service (SVC-01) — Adapter Specifications

Adapter Protocol Authentication Delivery Mechanism Polling Interval / Trigger Notes
Gmail HTTPS (Gmail API v1) OAuth2 (offline access, scope: gmail.readonly) Google Pub/Sub push notification → poll messages.get Event-driven (Pub/Sub) + 5-min fallback poll Requires GCP project. Watch request expires every 7 days — auto-renew.
Outlook HTTPS (Microsoft Graph API v1.0) OAuth2 (app-level or delegated, scope: Mail.Read) Change notification subscription (webhook) Event-driven (webhook) + 5-min fallback poll (delta query) Subscription max lifetime 3 days — auto-renew. Supports shared mailboxes.
IMAP IMAP4rev1 (TLS) Username/password or OAuth2 XOAUTH2 IDLE command (push) or poll INBOX IDLE (instant) + 29-min IDLE timeout reconnect Generic — works with any IMAP provider. Least metadata available.
API Upload HTTPS (REST) API key (header: X-API-Key) POST /api/v1/scan — multipart/form-data or raw EML body On-demand (user-initiated) Primary interface for testing and SIEM forwarding. Rate limit: 100 req/min/org.
Custom Implements EmailSource interface User-defined User-defined User-defined Interface contract: Fetch() → []RawEmail, Acknowledge(id), HealthCheck() → error
EmailSource interface (Go):
type EmailSource interface {
  Name() string
  Fetch(ctx context.Context, since time.Time) ([]RawEmail, error)
  Acknowledge(ctx context.Context, ids []string) error
  HealthCheck(ctx context.Context) error
}
type RawEmail struct { MessageID string; RFC822 []byte; ReceivedAt time.Time; SourceMetadata json.RawMessage }

3. Kafka Topic Registry

Topic Partitions Retention Key Producer(s) Consumer Group(s) Message Avg Size Expected Throughput
emails.raw648hemail_idSVC-01 (Ingestion)cg-parser50–500 KB~1 msg/s (steady), burst to 50/s
analysis.urls624hemail_idSVC-02 (Parser)cg-url-analysis1–10 KB= emails.raw rate
analysis.headers624hemail_idSVC-02 (Parser)cg-header-analysis2–5 KB= emails.raw rate
analysis.attachments324hemail_idSVC-02 (Parser)cg-attachment-analysis0.5–2 KB~60% of emails.raw (not all emails have attachments)
analysis.text624hemail_idSVC-02 (Parser)cg-nlp-analysis1–20 KB~95% of emails.raw (nearly all have body text)
analysis.plans624hemail_idSVC-02 (Parser)cg-aggregator0.2 KB= emails.raw rate
scores.url624hemail_idSVC-03 (URL Analysis)cg-aggregator1–5 KB= analysis.urls rate
scores.header624hemail_idSVC-04 (Header Analysis)cg-aggregator1–3 KB= analysis.headers rate
scores.attachment324hemail_idSVC-05 (Attachment Analysis)cg-aggregator0.5–2 KB= analysis.attachments rate
scores.nlp624hemail_idSVC-06 (NLP Analysis)cg-aggregator0.5–1 KB= analysis.text rate
emails.scored648hemail_idSVC-07 (Aggregator)cg-decision-engine5–30 KB (contains all component details)= emails.raw rate
emails.verdict67demail_idSVC-08 (Decision Engine)cg-notification, cg-dashboard1–3 KB= emails.raw rate
Partition key invariant: All topics use email_id as the partition key. This guarantees that all messages related to the same email land on the same partition within each topic, enabling ordered processing per email within each consumer. Consumer group cg-aggregator consumes 5 topics — however, cross-topic co-location is NOT guaranteed (topics have different partition counts). This is by design: aggregator instances are stateless workers with shared Redis state. Any instance can process any score event for any email.

4. Database Access Matrix — Complete

R = Read. W = Write (INSERT or UPDATE). U = UPSERT. = No access.

Table SVC-01
Ingestion
SVC-02
Parser
SVC-03
URL Analysis
SVC-04
Header
SVC-05
Attachment
SVC-06
NLP
SVC-07
Aggregator
SVC-08
Decision
SVC-09
Notif.
SVC-10
API/Dash
SVC-11
TI Sync
emailsWW (UPDATE)R
email_urlsWRR
email_attachmentsWRR
email_recipientsWR
attachment_libraryUR/WRW
enriched_threatsR/WR
ti_indicatorsRRRU
email_url_ti_matchesWR
enrichment_resultsUR/UR
enrichment_jobsWR
verdictsR/WR/W
campaignsR/UR
rulesRRR/W
rule_hitsWWR
feedsR/WR/W
organisationsRRR/W
usersRR/W
api_keysRR/W
audit_logW
mv_threat_summaryRREFRESH
mv_campaign_summaryRREFRESH
mv_feed_healthRREFRESH
mv_rule_performanceRREFRESH
mv_org_ingestion_summaryRREFRESH

5. Redis Key Registry

Key Pattern Type TTL Set By Read By Purpose
dedup:{org_id}:{message_id}STRING (1)7 daysSVC-01SVC-01Email ingestion deduplication. Value: "1". Existence check only.
url_score:{sha256(url)}HASH6 hoursSVC-03SVC-03URL score cache. Fields: score, ti_matched, domain, cached_at. Avoids re-enrichment.
ti_domain:{domain}HASH1 hourSVC-11SVC-03, SVC-04TI domain cache. Fields: ti_indicator_id, risk_score, threat_type. Sourced from ti_indicators. Avoids DB query per URL.
ti_hash:{sha256}HASH1 hourSVC-11SVC-05TI attachment hash cache. Fields: is_malicious, risk_score, source.
aggregator:{email_id}HASH120 secondsSVC-07SVC-07Score aggregation state. Fields: expected (JSON), received (JSON), created_at.
simhash:{org_id}:{campaign_id}STRING30 daysSVC-08SVC-0864-bit SimHash for campaign near-duplicate detection. Per-org scoped to prevent cross-tenant matching.
notif:{org_id}:{campaign_id}STRING (counter)3600 secondsSVC-09SVC-09Notification rate limiter. INCR on each alert. Block if > 1.
rules_cache:{org_id}STRING (JSON)60 secondsSVC-04, SVC-08SVC-04, SVC-08In-memory rule cache. Full rule set serialized. Avoids DB poll per email.

6. Error Handling & Timeout Policy

Service Failure Scenario Behavior Retry Policy Fallback Score
SVC-01Adapter fetch failureLog error, retry on next poll cycleExponential backoff, max 3 retries per cycleN/A (no score)
SVC-01Kafka publish failureRetry with backoff, dead-letter after 5 attempts1s, 2s, 4s, 8s, 16sN/A
SVC-02MIME parse failureLog error, publish with empty URLs/attachments, body_plain = raw text extraction attemptNo retry (deterministic failure)N/A
SVC-03WHOIS/SSL/DNS timeout (per URL)Skip enrichment for that URL, use cached data or empty enrichment3 retries, 2s timeout eachscore = 50
SVC-03XGBoost subprocess timeout (5s)Return neutral score, log model timeoutNo retry (latency-critical)score = 50
SVC-03XGBoost subprocess crashRespawn process from pool, retry once1 retry, then score = 50score = 50
SVC-04Rule evaluation errorSkip malformed rule, log, continue with remaining rulesNo retry (skip rule)Rule impact = 0
SVC-05VirusTotal API rate limit (429)Use local hash lookup only, skip VT enrichmentBackoff until next rate windowUse local-only score
SVC-06NLP inference timeout (10s)Return neutral score, flag partial_analysisNo retry (latency-critical)score = 50
SVC-06Model loading failure (OOM)Service returns 503, consumer pausesKubernetes restart + health checkscore = 50 (via aggregator timeout)
SVC-07Score never arrives (component down)30s timeout → emit partial result with missing = 50No retry (timeout-based)missing = 50
SVC-07Redis connection failureIn-memory fallback (non-durable), log criticalReconnect with backoffFunctions degraded
SVC-08DB write failure (tx rollback)Retry entire transaction3 retries, exponential backoffKafka offset not committed → reprocessed
SVC-11Feed fetch failure (HTTP error)Log error, skip feed, continue with othersWill retry next 6h cycleStale data (last successful sync)

7. Deduplication Strategy — Complete

Stage What Is Deduped Method Storage Scope
Email Ingestion (SVC-01) Duplicate email deliveries Check (org_id, message_id) existence Redis set (fast path) → DB unique constraint (authoritative) Per-org (Message-IDs not globally unique)
URL Analysis (SVC-03) Recently-analyzed URLs SHA256 of full URL → Redis cache lookup Redis: url_score:{hash} TTL 6h Global (same URL across all orgs)
Attachment Library Identical file binaries SHA256 unique constraint on attachment_library PostgreSQL unique index Global (cross-tenant — intentional for threat signal)
TI Feed Sync (SVC-11) Duplicate threat indicators (indicator_value, indicator_type) unique constraint on ti_indicators PostgreSQL unique index Global
Campaign Detection (SVC-08) Near-duplicate emails (same campaign) SimHash of body text, Hamming distance ≤ 3 Redis: simhash:{org_id}:{campaign_id} TTL 30d Per-org
Campaign Fingerprinting (SVC-08) Exact-match campaign grouping SHA256 of normalized (sender_domain + url_domain + subject_template + intent) PostgreSQL: campaigns.fingerprint UNIQUE Per-org (after org_id addition)
Notification (SVC-09) Alert fatigue — repeated alerts for same campaign Redis counter per (org_id, campaign_id) with 1h TTL Redis: notif:{org_id}:{campaign_id} Per-org per-campaign

8. Campaign Detection Algorithm

8.1 Campaign Fingerprint Computation (Exact Match)

INPUT: scored email with verdict

  sender_domain  = lower(email.sender_domain)                           // e.g. "paypa1.com"
  url_domain     = lower(domain_of(email.riskiest_url))                 // e.g. "secure-login.paypa1.com"
                   OR "" if no URLs
  subject_tpl    = template(email.subject)                              // replace names/nums with placeholders
                   e.g. "Your {BRAND} account #{NUM} needs verification"
  intent         = nlp_facets.intent_label                              // e.g. "credential_harvesting"

  fingerprint    = SHA256( sender_domain + "|" + url_domain + "|" + SHA256(subject_tpl) + "|" + intent )

DB OPERATION:
  INSERT INTO campaigns (fingerprint, org_id, threat_type, target_brand, first_seen, last_seen, risk_score)
  VALUES ($fingerprint, $org_id, $threat_type, $target_brand, NOW(), NOW(), $risk_score)
  ON CONFLICT (fingerprint) DO UPDATE SET
      last_seen = NOW(),
      risk_score = (campaigns.risk_score + $risk_score) / 2;   -- rolling average

  UPDATE emails SET campaign_id = <returned campaign_id> WHERE internal_id = $email_id AND fetched_at = $fetched_at;
  

8.2 SimHash Near-Duplicate Detection (Fuzzy Match)

INPUT: email plain_text body

  simhash = compute_simhash_64bit(tokenize(plain_text))       // 64-bit SimHash
  
  FOR each campaign_id in Redis keys matching simhash:{org_id}:*:
      stored_hash = GET simhash:{org_id}:{campaign_id}
      distance = hamming_distance(simhash, stored_hash)
      IF distance <= 3:                                        // threshold: 3 bits out of 64
          → associate email with this campaign (even if fingerprint differs)
          → BREAK (take first match; campaigns ordered by last_seen DESC)
  
  IF no match found AND email.risk_score >= 50:
      → new campaign was already created by fingerprint logic above
      → SET simhash:{org_id}:{campaign_id} = simhash (TTL 30d)
  

9. Data Privacy & Retention Policy

Data Category Storage Location Retention Purge Method Justification
Raw email body (body_plain, body_html) PostgreSQL emails Configurable per org (default: 30 days) Scheduled job: UPDATE emails SET body_plain=NULL, body_html=NULL WHERE fetched_at < NOW() - retention_interval AND org_id = $org Contains user content / PII. Not needed after analysis.
Email metadata (sender, subject, headers, scores) PostgreSQL emails Indefinite Soft delete via deleted_at (user request) Required for campaign tracking, trend analysis, dashboards.
Raw email bytes Kafka emails.raw 48 hours Kafka topic retention policy (automatic) Needed only for replay/reprocessing. Auto-expires.
Analysis intermediate messages Kafka analysis.*, scores.* 24 hours Kafka topic retention policy (automatic) Transient pipeline data. Not needed after verdict.
Verdict messages Kafka emails.verdict 7 days Kafka topic retention policy (automatic) Longer retention for notification retry / dashboard catch-up.
Verdicts (DB) PostgreSQL verdicts Indefinite (append-only) Never deleted — compliance audit trail Immutable record of all classification decisions.
Attachment binaries Object store (S3 / MinIO via storage_uri) 90 days S3 lifecycle policy Useful for forensic analysis. Hash + metadata persist indefinitely.
Attachment hashes & metadata PostgreSQL attachment_library Indefinite N/A Hash-only — no PII. Cross-tenant malware signal.
Threat intelligence (email-observed) PostgreSQL enriched_threats Indefinite (soft-delete stale) deleted_at set for entries not seen in 90 days URLs/domains observed in emails, enriched with WHOIS/SSL/geo.
Threat intelligence (feed indicators) PostgreSQL ti_indicators Indefinite (soft-delete stale) deleted_at set for entries not seen in 90 days Cheap, normalized feed data for matching. No enrichment columns.
Feed–email URL match audit PostgreSQL email_url_ti_matches Same as parent email_urls CASCADE from email_urls Junction table recording which TI feed indicator matched a URL.
Enrichment API responses PostgreSQL enrichment_results TTL-based (expires_at) Staleness sweep re-fetches expired entries Cacheable external API responses. Re-fetch when stale.
Audit log PostgreSQL audit_log Indefinite (append-only) Never deleted — compliance Regulatory / compliance requirement.

10. Infrastructure Requirements

PostgreSQL 15+

Apache Kafka

Redis 7+

11. Scaling Strategy

Service Stateless? Scaling Method Scale Trigger Min / Max Instances
SVC-01 IngestionYesHorizontal (multiple adapter instances)Ingestion lag > 1000 emails1 / 4
SVC-02 ParserYesHorizontal (Kafka consumer group)Consumer lag on emails.raw > 5001 / 6
SVC-03 URL AnalysisYesHorizontal (Kafka consumer group)Consumer lag on analysis.urls > 5001 / 6
SVC-04 Header AnalysisYesHorizontal (Kafka consumer group)Consumer lag on analysis.headers > 10001 / 3
SVC-05 Attachment AnalysisYesHorizontal (Kafka consumer group)Consumer lag on analysis.attachments > 5001 / 3
SVC-06 NLP AnalysisYesHorizontal (Kafka consumer group)Consumer lag on analysis.text > 200 (heavier compute)1 / 4
SVC-07 AggregatorNo (Redis state)Horizontal OK (shared Redis state)Consumer lag on any scores.* > 5001 / 3
SVC-08 Decision EngineYesHorizontal (Kafka consumer group)Consumer lag on emails.scored > 5001 / 3
SVC-09 NotificationYesHorizontalRarely bottleneck1 / 2
SVC-10 API/DashboardYesHorizontal (behind load balancer)HTTP request latency p95 > 500ms2 / 6
SVC-11 TI SyncYes (scheduled)Single instance (leader election)N/A1 / 1

12. Observability & SLA Targets

Metric Target Measurement Point Alert Threshold
End-to-end latency (ingestion → verdict)< 60s p95emails.verdict.processing_time_total_ms> 90s p95 over 5 min window
Email Parser throughput> 10 emails/s sustainedKafka consumer lag on emails.rawLag > 1000 for > 5 min
ML inference latency (URL XGBoost)< 5s p99SVC-03 histogram> 5s p99 (timeout territory)
ML inference latency (NLP ensemble)< 10s p99SVC-06 histogram> 10s p99 (timeout territory)
Score Aggregator completeness> 95% full analysis (not partial)partial_analysis: true rate> 10% partial over 15 min
TI feed freshness< 7 hours since last syncfeeds.last_fetched_at> 8 hours since last successful sync
DB connection pool utilization< 80%pgbouncer stats> 90% for > 2 min
Redis memory utilization< 80% of maxmemoryRedis INFO memory> 90%
Kafka disk usage per broker< 70%Broker filesystem> 85%
Error rate (any service)< 1% of processed messagesPer-service error counters> 5% over 5 min
Tracing: OpenTelemetry SDK in all Go/Python services. Trace context propagated via Kafka headers (traceparent). Traces exported to Jaeger. email_id attached as span attribute on every operation for cross-service correlation.
Logging: zerolog (Go), structlog (Python). JSON format. Fields: service, email_id, org_id, trace_id, level, msg, error. Shipped to ELK or Loki.
Metrics: Prometheus exposition on :9090/metrics per service. Grafana dashboards.

13. Table Origin Reference — Migration File → Table

Table / ViewDefined InColumns Added InColumn Count
enriched_threats001_initial_schema.up.sql002 (+org_id), 003 (+feed_id)35 (email-observed URLs; enrichment data)
campaigns001_initial_schema.up.sql002 (+org_id)13
emails (partitioned)001_initial_schema.up.sql002 (+org_id, +uq_emails_org_message_id)30
verdicts001_initial_schema.up.sql002 (+created_by)10
email_urls001_initial_schema.up.sql5
attachment_library001_initial_schema.up.sql008 (+storage_uri, +updated_at)13
email_attachments001_initial_schema.up.sql008 (+created_at)10
organisations002_add_users_orgs.up.sql6
users002_add_users_orgs.up.sql10
api_keys002_add_users_orgs.up.sql10
audit_log002_add_users_orgs.up.sql9
feeds003_add_enrichments.up.sql9
enrichment_jobs003_add_enrichments.up.sql10
enrichment_results003_add_enrichments.up.sql008 (+ttl_seconds, +expires_at)11
rules003_add_enrichments.up.sql008 (+rule_group_id)11
rule_hits003_add_enrichments.up.sql7
email_recipients008_add_missing_schema.up.sql7
ti_indicators026_add_ti_indicators.up.sql12
email_url_ti_matches026_add_ti_indicators.up.sql5
current_verdicts (VIEW)001_initial_schema.up.sql8 (DISTINCT ON verdicts)
mv_threat_summary004_add_materialized_view.up.sqlAggregates enriched_threats
mv_campaign_summary004_add_materialized_view.up.sqlJoins campaigns + emails + verdicts
mv_feed_health004_add_materialized_view.up.sqlJoins feeds + enriched_threats
mv_rule_performance004_add_materialized_view.up.sqlJoins rules + rule_hits
mv_org_ingestion_summary004_add_materialized_view.up.sqlJoins emails + verdicts
Design note — email_urls.threat_id FK dependency:
The email_urls table (001) has a threat_id BIGINT NOT NULL REFERENCES enriched_threats(id). This means the Email Parser (SVC-02) must UPSERT a bare row into enriched_threats (with just url + domain) before it can INSERT into email_urls, to obtain the threat_id FK. The URL Analysis Service (SVC-03) later UPDATEs that same enriched_threats row with full enrichment data (WHOIS, SSL, geo, etc.). This two-phase write pattern is documented in the column-level flow below.

Design note — email_url_ti_matches junction table (026):
When SVC-03 matches a URL against ti_indicators (feed data), it records the match in email_url_ti_matches linking email_urls.idti_indicators.id. This separates the feed-match audit trail from the enrichment data in enriched_threats. Feed indicators (ti_indicators) are cheap, normalized, and used for matching only. Email-observed threats (enriched_threats) carry expensive enrichment data (WHOIS/SSL/geo/ASN).

14. Column-Level Data Flow — Full Traceability

For every pipeline step: exactly which table columns are read, which are written, and how Kafka message fields map to DB columns. Migration file references in parentheses.

Step 1 — Email Ingestion Service (SVC-01)

OperationTable (Migration)Columns AccessedDirectionNotes
Auth lookup api_keys (002) key_hash, org_id, scopes, revoked_at, expires_at READ Validate API key → extract org_id. Also updates last_used_at.
Ingestion quota organisations (002) id, monthly_ingestion_limit READ Enforce per-org monthly ingestion cap. Cached in Redis.
Dedup (fast) Redis dedup:{org_id}:{message_id} READ then WRITE SETNX check. If key exists → duplicate, discard. If not → set with TTL 7d.
Dedup (authoritative) emails (001+002) org_id, message_id, fetched_at READ (via constraint) Unique constraint uq_emails_org_message_id(org_id, message_id, fetched_at) from migration 002. Falls through to this only on Redis miss.
Kafka output emails.raw → DB column mapping:
The emails.raw message carries raw_rfc822 bytes. These are NOT written to any DB column at this stage. The raw bytes flow to SVC-02 (Parser) which extracts them into individual emails table columns.

Step 2 — Email Parser Service (SVC-02)

OperationTable (Migration)Columns WrittenSource
INSERT email emails (001+002) internal_id (auto), fetched_at (from Kafka msg), org_id (from Kafka msg, 002), message_id, sender_name, sender_email, sender_domain, reply_to_email, return_path, originating_ip, auth_spf, auth_dkim, auth_dmarc, auth_arc, x_originating_ip, mailer_agent, in_reply_to, references_list, content_charset, precedence, list_id, vendor_security_tags (JSONB), subject, sent_timestamp, headers_json (JSONB), body_plain, body_html All extracted from parsed RFC 822 MIME. Maps 1:1 to ParsedEmail struct fields in pkg/models/email.go.
Score columns NOT written at this stage: header_risk_score, content_risk_score, attachment_risk_score, url_risk_score, risk_score, campaign_id, analysis_metadata — these remain at DEFAULT 0 / NULL until Decision Engine (Step 5) updates them.
UPSERT bare URL enriched_threats (001) url, domain, tld For each URL extracted from email body/headers. UPSERT by url unique constraint. Writes ONLY url+domain+tld. All enrichment columns (WHOIS, SSL, geo, etc.) left NULL — populated later by SVC-03.
INSERT URL link email_urls (001) email_idemails.internal_id,
email_fetched_atemails.fetched_at,
threat_idenriched_threats.id (from UPSERT above),
visible_text
One row per URL per email. threat_id FK obtained from the enriched_threats UPSERT RETURNING id.
UPSERT attachment hash attachment_library (001+008) sha256, md5, sha1, actual_extension, size_bytes, entropy UPSERT by sha256 unique constraint. is_malicious, risk_score, threat_tags, storage_uri left at defaults — updated later by SVC-05 / SVC-11.
INSERT attachment link email_attachments (001+008) email_id, email_fetched_at, attachment_idattachment_library.id,
filename, content_type, content_id, disposition, created_at (008)
One row per attachment per email. analysis_metadata and risk_score left NULL/0 — updated by SVC-05.
INSERT recipients email_recipients (008) email_id, email_fetched_at, org_id, address, display_name, recipient_type ('to'|'cc'|'bcc') One row per recipient (To, CC, BCC) extracted from headers.
Kafka output fields → DB column traceability:

analysis.urls fields: url → comes from parser's URL extraction → already written to enriched_threats.url; visible_text → already written to email_urls.visible_text. The Kafka message carries them to SVC-03 for scoring.

analysis.headers fields: sender_emailemails.sender_email; sender_domainemails.sender_domain; auth_spfemails.auth_spf; auth_dkimemails.auth_dkim; auth_dmarcemails.auth_dmarc; auth_arcemails.auth_arc; originating_ipemails.originating_ip; x_originating_ipemails.x_originating_ip; mailer_agentemails.mailer_agent; reply_to_emailemails.reply_to_email; return_pathemails.return_path; in_reply_toemails.in_reply_to; references_listemails.references_list; sent_timestampemails.sent_timestamp; content_charsetemails.content_charset; precedenceemails.precedence; list_idemails.list_id; vendor_security_tagsemails.vendor_security_tags; headers_jsonemails.headers_json. All already persisted in DB by the INSERT above; the Kafka message passes copies to SVC-04 for analysis so SVC-04 doesn't need to re-read the DB.

analysis.attachments fields: sha256attachment_library.sha256; md5attachment_library.md5; sha1attachment_library.sha1; filenameemail_attachments.filename; content_typeemail_attachments.content_type; size_bytesattachment_library.size_bytes; entropyattachment_library.entropy. Already persisted; copies sent to SVC-05.

analysis.text fields: plain_textemails.body_plain; subjectemails.subject; sender_nameemails.sender_name; sender_domainemails.sender_domain. Already persisted; copies sent to SVC-06. Fields body_language and word_count are computed by the parser and only exist in the Kafka message (not stored in DB).

Step 3a — URL Analysis Service (SVC-03)

OperationTable (Migration)Columns AccessedDirectionDetails
TI feed match ti_indicators (026) id, indicator_value, indicator_type, risk_score, threat_type, target_brand, deleted_at READ Lookup by indicator_value (unique index) or domain prefix. Only active rows (deleted_at IS NULL). Cached in Redis ti_domain:{domain}.
Prior-email URL lookup enriched_threats (001) id, url, domain, risk_score, threat_type, target_brand, deleted_at READ Check if this URL was previously observed and enriched in a prior email. Uses existing enrichment data to avoid redundant API calls.
UPDATE enrichment data enriched_threats (001) online, http_status_code, ip_address, cidr_block, asn, asn_name, isp, country, country_name, region, city, latitude, longitude, ssl_enabled, cert_issuer, cert_subject, cert_valid_from, cert_valid_to, cert_serial, tld, registrar, creation_date, expiry_date, updated_date, name_servers, page_language, page_title, last_checked, last_seen, risk_score (from ML), analysis_metadata (JSONB) WRITE (UPDATE) Updates the bare row created by SVC-02. All 28 enrichment columns populated from WHOIS, SSL, DNS, HTTP, geo API responses. Maps 1:1 to EnrichmentData struct in pkg/models/url.go.
INSERT TI match audit email_url_ti_matches (026) email_url_idemail_urls.id, ti_indicator_idti_indicators.id, matched_at WRITE One row per URL-indicator hit. Records which feed indicator matched which email URL for full audit trail.
INSERT enrichment result enrichment_results (003+008) entity_type='threat', entity_idenriched_threats.id, provider (e.g. 'whois','ipinfo','ssllabs'), raw_response (JSONB), malicious_votes, harmless_votes, suspicious_votes, reputation_score, ttl_seconds (008), expires_at (008, trigger-computed) UPSERT One row per provider per URL. UPSERT by (entity_type, entity_id, provider) unique constraint. Raw API responses stored for re-scoring without re-fetching.
INSERT enrichment job enrichment_jobs (003) job_type (whois|dns|asn|ip_geo|ssl_cert|url_scan), status, entity_type='threat', entity_id, attempts, max_attempts=3, last_error, started_at, completed_at WRITE One job per enrichment type per URL. Tracks retries and failures. Partial index idx_jobs_pending (003) used for worker poll.
Kafka scores.url output → data provenance:
score → computed from XGBoost ML on features extracted from enriched_threats columns (domain_age = NOW()-creation_date, ssl_enabled, registrar, country, etc.).
ti_matched → true if ti_indicators.risk_score ≥ 80 at lookup time, or enriched_threats.risk_score ≥ 80 from a prior email.
domain_age_days → NOW() - enriched_threats.creation_date.
has_sslenriched_threats.ssl_enabled.
ssl_issuerenriched_threats.cert_issuer.
redirect_count → from HTTP probe, stored in enrichment_results.raw_response JSONB.
None of these score fields are written to DB at this stage — they flow via Kafka to SVC-07 (Aggregator) then SVC-08 (Decision Engine) which writes the final emails.url_risk_score.

Step 3b — Header Analysis Service (SVC-04)

OperationTable (Migration)Columns ReadDetails
Load rules rules (003+008) id, org_id, name, version, status, logic (JSONB), score_impact, target, rule_group_id (008) WHERE status='active' AND (target='header' OR target='email') AND (org_id=? OR org_id IS NULL). Cached in Redis rules_cache:{org_id} TTL 60s.
Domain reputation ti_indicators (026) indicator_value, risk_score, threat_type, first_seen_at Lookup sender_domain (from Kafka message) against ti_indicators.indicator_value WHERE indicator_type='domain'. Used for: domain age check (first_seen_at), TI match (risk_score). Cached in Redis ti_domain:{domain}.
OperationTable (Migration)Columns WrittenDetails
Record rule fires rule_hits (003) rule_idrules.id, rule_versionrules.version, entity_type='email', entity_idemails.internal_id, score_impactrules.score_impact, match_detail (JSONB: which signal fired, e.g. {"signal":"spf_fail","value":"fail"}) One INSERT per fired rule. Append-only. Links back to both the rule definition and the email it fired on.
Input traceability — Kafka analysis.headers fields → emails columns (001):
Every field in the analysis.headers Kafka message maps to a column already persisted in the emails table by SVC-02. SVC-04 receives them via Kafka to avoid a DB round-trip. The mapping is:
auth_spfemails.auth_spf (001)  |  auth_dkimemails.auth_dkim (001)  |  auth_dmarcemails.auth_dmarc (001)  |  auth_arcemails.auth_arc (001)  |  sender_domainemails.sender_domain (001)  |  reply_to_emailemails.reply_to_email (001)  |  return_pathemails.return_path (001)  |  originating_ipemails.originating_ip (001)  |  x_originating_ipemails.x_originating_ip (001)  |  mailer_agentemails.mailer_agent (001)  |  sent_timestampemails.sent_timestamp (001)  |  headers_jsonemails.headers_json (001, JSONB — contains the received chain for hop count analysis)

Output: scores.header carries the computed header_risk_score. This value is NOT written to DB here — it flows to SVC-07→SVC-08 which writes emails.header_risk_score (001).

Step 3c — Attachment Analysis Service (SVC-05)

OperationTable (Migration)Columns AccessedDirectionDetails
Hash lookup attachment_library (001+008) sha256, is_malicious, risk_score, threat_tags READ Lookup by sha256 (unique index). Cached in Redis ti_hash:{sha256}.
VT enrichment cache enrichment_results (003+008) entity_type='attachment', entity_id, provider='virustotal', raw_response, malicious_votes, harmless_votes, suspicious_votes, expires_at (008) READ then UPSERT Check if cached VT result exists and is not expired (expires_at > NOW()). If miss or stale → query VT API → UPSERT with ttl_seconds=86400 (008).
Update malicious flag attachment_library (001) is_malicious, risk_score, threat_tags, updated_at (008, trigger) WRITE (UPDATE) If VT returns malicious_votes > 0 or heuristics flag dangerous: update is_malicious=true, risk_score, append to threat_tags. The updated_at trigger (008) fires automatically.
Input traceability — Kafka analysis.attachments fields → DB columns:
sha256attachment_library.sha256 (001)  |  md5attachment_library.md5 (001)  |  sha1attachment_library.sha1 (001)  |  filenameemail_attachments.filename (001)  |  content_typeemail_attachments.content_type (001)  |  entropyattachment_library.entropy (001)  |  size_bytesattachment_library.size_bytes (001)  |  content_idemail_attachments.content_id (001)  |  dispositionemail_attachments.disposition (001)

Output: scores.attachment carries the computed attachment_risk_score. Flows to SVC-07→SVC-08 which writes emails.attachment_risk_score (001).

Step 3d — NLP Analysis Service (SVC-06)

OperationTable (Migration)Columns AccessedDirectionDetails
No database access. Stateless inference only.
Input traceability — Kafka analysis.text fields → DB columns:
plain_textemails.body_plain (001)  |  subjectemails.subject (001)  |  sender_nameemails.sender_name (001)  |  sender_domainemails.sender_domain (001)
Fields body_language and word_count are parser-computed and not stored in any DB table.

Output: scores.nlp carries the computed content_risk_score. Flows to SVC-07→SVC-08 which writes emails.content_risk_score (001).

Step 4 — Score Aggregator Service (SVC-07)

OperationStorageDetails
State tracking Redis only Key aggregator:{email_id} (TTL 120s). No PostgreSQL tables accessed. Consumes from 5 Kafka topics, produces to 1. Pure event-driven state machine.
Score provenance in emails.scored output:
url_score → from scores.url.score (SVC-03) → destined for emails.url_risk_score (001)
header_score → from scores.header.score (SVC-04) → destined for emails.header_risk_score (001)
attachment_score → from scores.attachment.score (SVC-05) → destined for emails.attachment_risk_score (001)
nlp_score → from scores.nlp.score (SVC-06) → destined for emails.content_risk_score (001)
aggregated_score → weighted combination → destined for emails.risk_score (001) after rule adjustment in SVC-08

Step 5 — Decision Engine / Verdict Service (SVC-08)

OperationTable (Migration)Columns AccessedDirectionSQL Pattern
Load rules rules (003+008) id, org_id, name, version, status, logic, score_impact, target, rule_group_id (008) READ SELECT * FROM rules WHERE status='active' AND (org_id=? OR org_id IS NULL)
Campaign lookup campaigns (001+002) id, fingerprint, org_id (002) READ SELECT id FROM campaigns WHERE fingerprint=?
UPDATE email scores emails (001) risk_score ← final_score (after rule adjustments),
header_risk_score ← from scores.header,
content_risk_score ← from scores.nlp,
url_risk_score ← from scores.url,
attachment_risk_score ← from scores.attachment,
campaign_id ← from campaign UPSERT below,
analysis_metadata (JSONB) ← full component breakdown
WRITE (UPDATE) UPDATE emails SET risk_score=?, header_risk_score=?, content_risk_score=?, url_risk_score=?, attachment_risk_score=?, campaign_id=?, analysis_metadata=? WHERE internal_id=? AND fetched_at=?
INSERT verdict verdicts (001+002) entity_type='email',
entity_idemails.internal_id,
label ← verdict_label ENUM (001),
confidence ← computed separately: base = distance from nearest threshold / 25, penalties for partial_analysis (×0.7) and rule-only verdict (×0.5),
source ← 'model' verdict_source ENUM (001),
model_version ← from config,
notes ← NULL (automated),
created_by (002) ← NULL (automated)
WRITE (INSERT) Append-only. Never UPDATE existing verdicts. Current verdict resolved by current_verdicts VIEW (001): DISTINCT ON (entity_type, entity_id) ORDER BY created_at DESC.
INSERT rule hits rule_hits (003) rule_id, rule_version, entity_type='email', entity_id, score_impact, match_detail (JSONB) WRITE (INSERT) One row per rule that fired. Same structure as SVC-04 rule_hits but with target='email' rules.
UPSERT campaign campaigns (001+002) fingerprint ← SHA256(sender_domain|url_domain|subject_tpl|intent),
org_id (002),
name ← auto-generated,
threat_type,
target_brand,
first_seen / last_seen,
risk_score ← rolling avg
UPSERT INSERT INTO campaigns (fingerprint, org_id, ...) ON CONFLICT (fingerprint) DO UPDATE SET last_seen=NOW(), risk_score=(campaigns.risk_score + $score)/2

Step 6a — Notification Service (SVC-09)

OperationTable (Migration)Columns ReadDetails
Org preferences organisations (002) id, name, slug, monthly_ingestion_limit Notification threshold not yet a column — requires migration (see §15 Schema Gap #1). Until migration is applied, SVC-09 uses config-based defaults: threshold=70, channels=["email"]. Migration adds notification_threshold INT DEFAULT 70 and notification_channels JSONB DEFAULT '["email"]' to organisations table.
Admin contacts users (002) id, org_id, email, display_name, role SELECT email, display_name FROM users WHERE org_id=? AND role='admin' AND deleted_at IS NULL

Step 6b — API / Dashboard Service (SVC-10)

OperationTable (Migration)ColumnsDirection
Email list/detailemails (001+002)All 30 columnsREAD
Email URLsemail_urls (001) JOIN enriched_threats (001)All columns from both (URL details + threat details)READ
Email attachmentsemail_attachments (001+008) JOIN attachment_library (001+008)All columns from bothREAD
Email recipientsemail_recipients (008)All 7 columnsREAD
Current verdictcurrent_verdicts VIEW (001)entity_type, entity_id, label, confidence, source, model_version, notes, created_atREAD
Verdict historyverdicts (001+002)All 10 columnsREAD
Campaign list/detailmv_campaign_summary (004)All MV columns (joins campaigns+emails+verdicts)READ
Org dashboard statsmv_org_ingestion_summary (004)All MV columns (aggregates emails+verdicts)READ
TI dashboardmv_threat_summary (004)All MV columns (aggregates enriched_threats)READ
Feed healthmv_feed_health (004)All MV columns (joins feeds+enriched_threats)READ
Rule performancemv_rule_performance (004)All MV columns (joins rules+rule_hits)READ
Rule CRUDrules (003+008)All 11 columns incl. rule_group_id (008)READ + WRITE
Analyst verdictverdicts (001+002)INSERT: entity_type, entity_id, label, confidence, source='analyst', notes, created_by (002) → users.idWRITE
User/org mgmtusers (002), organisations (002)All columnsREAD + WRITE
API key mgmtapi_keys (002)All 10 columnsREAD + WRITE
Feed mgmtfeeds (003)All 9 columnsREAD + WRITE
Audit loggingaudit_log (002)org_id, user_id, api_key_id, action, entity_type, entity_id, diff (JSONB), ip_addressWRITE (every mutation)
Enrichment jobsenrichment_jobs (003)All 10 columnsREAD (monitoring)
Enrichment resultsenrichment_results (003+008)All 11 columnsREAD (detail views)
Rule hitsrule_hits (003)All 7 columnsREAD (explainability)

Background — TI Feed Sync Service (SVC-11)

OperationTable (Migration)ColumnsDirectionDetails
Load feed config feeds (003) id, name, feed_type, url, last_fetched_at, enabled, reliability_weight READ SELECT * FROM feeds WHERE enabled=TRUE
Update feed status feeds (003) last_fetched_at, updated_at (trigger) WRITE UPDATE feeds SET last_fetched_at=NOW() WHERE id=?
UPSERT threat indicators ti_indicators (026) indicator_value, indicator_type (url|domain|ip|hash), threat_type, target_brand, threat_tags, feed_idfeeds.id, source_id (original feed record ID), first_seen_at, last_seen_at, risk_score, deleted_at (NULL for active) UPSERT INSERT INTO ti_indicators (indicator_value, indicator_type, ...) ON CONFLICT (indicator_value, indicator_type) DO UPDATE SET last_seen_at=NOW(), threat_tags=array_cat(ti_indicators.threat_tags, $new_tags), risk_score=GREATEST(ti_indicators.risk_score, $score). No enrichment columns — ti_indicators is a lightweight feed-only table.
UPSERT malware hashes attachment_library (001) sha256, is_malicious=TRUE, risk_score, threat_tags UPSERT MalwareBazaar feed provides SHA256 hashes of known malware. INSERT INTO attachment_library (sha256, is_malicious, risk_score, threat_tags) ON CONFLICT (sha256) DO UPDATE SET is_malicious=TRUE, risk_score=GREATEST(...)
Refresh MVs All 5 materialized views (004) REFRESH Calls refresh_all_materialized_views() (function defined in 004). CONCURRENTLY mode, no read blocking.
Update Redis caches Redis ti_domain:{domain}, ti_hash:{sha256} WRITE Bulk-load recently-updated ti_indicators domains and attachment_library hashes into Redis for fast lookup by SVC-03/04/05.

15. Implementation Deltas — Tracked Gaps

The following items are known differences between the architecture as specified and the current schema/implementation state. These are tracked deltas, not unresolved design questions. Each has a clear resolution path.

GapAffected TableMigration Needed?Resolution
organisations lacks notification config columns organisations (002) Yes — new migration Add notification_threshold INT DEFAULT 70, notification_channels JSONB DEFAULT '[]', retention_days INT DEFAULT 30. SVC-09 reads these; SVC-10 writes them.
email_attachments.risk_score never updated by SVC-05 email_attachments (001) No — just implementation SVC-05 should UPDATE email_attachments SET risk_score=?, analysis_metadata=? WHERE email_id=? AND attachment_id=? after scoring each attachment.
enrichment_results.suspicious_votes column present but not referenced enrichment_results (003) No SVC-03 and SVC-05 should populate this from VirusTotal last_analysis_stats.suspicious field. Already in schema, just needs implementation.
current_verdicts VIEW not referenced in spec VIEW on verdicts (001) No SVC-10 (API/Dashboard) should use current_verdicts for email detail views instead of manually DISTINCT ON. Already exists in 001.
rules.rule_group_id (008) not in spec rules (008) No SVC-10 uses rule_group_id when displaying rule version history. SVC-04/SVC-08 should include it in rule cache for audit trail. Already in schema.

END OF DOCUMENT — ARCH-SPEC-v2.2 — CyberSiren System Architecture Specification — 2026-03-15
This document is the authoritative reference for implementation. All service boundaries, message schemas, database access patterns, column-level data flows, and migration references defined herein constitute the approved architecture baseline. Items in the Schema Gap section (§15) are tracked implementation deltas.