From 33708e51e7f9c6ef4cd0f1503971b4607293a155 Mon Sep 17 00:00:00 2001 From: Waleed Latif Date: Thu, 4 Jun 2026 12:06:15 -0700 Subject: [PATCH 1/2] chore(skills): mirror model/enrichment/hosted-key/council skills into .agents/skills and expand add-model touchpoints --- .agents/skills/add-enrichment/SKILL.md | 142 +++++++++ .../skills/add-enrichment/agents/openai.yaml | 5 + .agents/skills/add-hosted-key/SKILL.md | 296 ++++++++++++++++++ .../skills/add-hosted-key/agents/openai.yaml | 5 + .agents/skills/add-model/SKILL.md | 209 +++++++++++++ .agents/skills/add-model/agents/openai.yaml | 5 + .agents/skills/council/SKILL.md | 12 + .agents/skills/validate-model/SKILL.md | 170 ++++++++++ .../skills/validate-model/agents/openai.yaml | 5 + .claude/commands/add-model.md | 56 +++- .claude/commands/validate-model.md | 4 + 11 files changed, 906 insertions(+), 3 deletions(-) create mode 100644 .agents/skills/add-enrichment/SKILL.md create mode 100644 .agents/skills/add-enrichment/agents/openai.yaml create mode 100644 .agents/skills/add-hosted-key/SKILL.md create mode 100644 .agents/skills/add-hosted-key/agents/openai.yaml create mode 100644 .agents/skills/add-model/SKILL.md create mode 100644 .agents/skills/add-model/agents/openai.yaml create mode 100644 .agents/skills/council/SKILL.md create mode 100644 .agents/skills/validate-model/SKILL.md create mode 100644 .agents/skills/validate-model/agents/openai.yaml diff --git a/.agents/skills/add-enrichment/SKILL.md b/.agents/skills/add-enrichment/SKILL.md new file mode 100644 index 00000000000..f963ce14517 --- /dev/null +++ b/.agents/skills/add-enrichment/SKILL.md @@ -0,0 +1,142 @@ +--- +name: add-enrichment +description: Add a code-defined table enrichment (registry entry) under `apps/sim/enrichments/` backed by an ordered provider cascade, ensuring every provider tool it calls has hosted-key support. Use when adding a per-row table enrichment that fills cells via existing Sim tools. +--- + +# Adding a Table Enrichment + +Enrichments are code-defined entries in `apps/sim/enrichments/` that run **directly per table row** (no workflow). Each enrichment declares inputs, outputs, and an ordered list of **providers**; the cascade runner tries providers in order and the first non-empty result fills the cell. Each provider calls one existing Sim tool via `executeTool`, which injects the workspace's BYOK key or a **hosted key** and bills usage automatically. + +Because enrichments run on Sim's hosted keys by default, **every provider tool you reference must have hosted-key support** — otherwise it can only run when the workspace brings its own key. This command makes that check a required step. + +## Overview + +| Step | What | Where | +|------|------|-------| +| 1 | Pick the data-source tool(s) for each output | `tools/{service}/` + `tools/registry.ts` | +| 2 | **Verify each tool has `hosting`; if not, run `/add-hosted-key`** | `tools/{service}/{action}.ts` | +| 3 | Write the enrichment definition | `enrichments/{name}/{name}.ts` + `index.ts` | +| 4 | Register it | `enrichments/registry.ts` | +| 5 | Verify | tsc / biome / manual run | + +## Architecture (what you're plugging into) + +- **`enrichments/types.ts`** — `EnrichmentConfig { id, name, description, icon, inputs, outputs, providers }` and `EnrichmentProvider { id, label, toolId, buildParams, mapOutput }`. Providers are **plain data** (no `@/tools` import) so the catalog stays client-safe. +- **`enrichments/providers.ts`** — `toolProvider(...)` (typed passthrough) plus shared input helpers: `str(v)`, `normalizeDomain(v)`, `firstNonEmpty(arr)`, `splitName(fullName)`. +- **`enrichments/run.ts`** — the server-only cascade runner. Calls `executeTool(provider.toolId, { ...params, _context: { workspaceId } })`, accumulates hosted-key cost, returns the first non-empty mapped result. **You do not edit this** — it works for any registry entry. +- **`enrichments/registry.ts`** — `ENRICHMENT_REGISTRY` / `ALL_ENRICHMENTS` / `getEnrichment`. Register new entries here. + +Outputs automatically become table columns; billing, the catalog/sidebar UI, the column meta-header icon, and per-row execution all work with no extra wiring. + +## Step 1: Pick the data-source tool(s) + +For each output the enrichment produces, decide which existing tool provides it. Look up the service's API and the tool in `apps/sim/tools/{service}/` (e.g. `hunter_email_finder`, `pdl_person_enrich`, `pdl_company_enrich`). Confirm: + +- The tool id is registered in `apps/sim/tools/registry.ts`. +- Its `params` accept what you can derive from table columns (read the tool's `params`). +- Its `outputs` / `transformResponse` actually expose the field you need (read the real output shape — don't assume). + +Order providers **cheapest / most-likely-to-hit first**; the cascade stops at the first non-empty result. Apollo / LinkedIn are not hosted-safe (ToS) — don't use them. + +## Step 2: Verify hosted-key support — chain to `/add-hosted-key` if missing + +**This is the required gate.** For every tool a provider calls, open `apps/sim/tools/{service}/{action}.ts` and check for a `hosting` block: + +```typescript +hosting: { + envKeyPrefix: 'SERVICE_API_KEY', + apiKeyParam: 'apiKey', + byokProviderId: 'service', + pricing: { /* ... */ }, + rateLimit: { /* ... */ }, +} +``` + +- **If `hosting` is present** — good. Note the `envKeyPrefix`; the deployment needs `{PREFIX}_COUNT` + `{PREFIX}_1..N` env vars set for the hosted key to actually resolve at runtime (ops concern, not code). If those env vars aren't set in the target environment, the provider will only run with a workspace BYOK key. +- **If `hosting` is absent** — the tool can't use a Sim-provided key, so the enrichment would silently produce blank cells on hosted Sim. **Stop and run `/add-hosted-key `** to add hosted-key support to that tool first, then come back. Do this for every provider tool that lacks it. + +Why it matters: the cascade runner only bills (and only reads `output.cost.total`) when `executeTool` injected a hosted key, which requires the tool's `hosting` config. No `hosting` → no hosted key → the enrichment depends entirely on per-workspace BYOK. + +## Step 3: Write the enrichment definition + +Create `apps/sim/enrichments/{name}/{name}.ts` and a barrel `index.ts`. Mirror the existing entries (`work-email`, `phone-number`, `company-domain`, `company-info`). + +```typescript +import { SomeIcon } from 'lucide-react' +import { filterUndefined } from '@sim/utils/object' +import { normalizeDomain, splitName, str, toolProvider } from '@/enrichments/providers' +import type { EnrichmentConfig } from '@/enrichments/types' + +export const myEnrichment: EnrichmentConfig = { + id: 'my-enrichment', + name: 'My Enrichment', + description: 'One concise sentence describing what it finds.', + icon: SomeIcon, + inputs: [ + // Person enrichments take a single canonical `fullName` (Clay-style); + // split it with splitName() for tools that need first/last. + { id: 'fullName', name: 'Full name', type: 'string', required: true }, + { id: 'companyDomain', name: 'Company domain', type: 'string' }, + ], + outputs: [{ id: 'value', name: 'value', type: 'string' }], + providers: [ + toolProvider({ + id: 'provider-a', + label: 'Provider A', + toolId: 'service_action', // must have `hosting` (Step 2) + buildParams: (inputs) => { + // Return null when there aren't enough inputs → cascade skips this provider. + const name = splitName(inputs.fullName) + const domain = normalizeDomain(inputs.companyDomain) + if (!name || !domain) return null + return { domain, first_name: name.firstName, last_name: name.lastName } + }, + mapOutput: (output) => { + // Return { [outputId]: value } on a hit, or null to fall through. + const value = str(output.value) + return value ? { value } : null + }, + }), + // ...additional fallback providers, in priority order. + ], +} +``` + +```typescript +// apps/sim/enrichments/{name}/index.ts +export { myEnrichment } from './my-enrichment' +``` + +Rules: +- Keep the file **client-safe**: import only `lucide-react`, `@sim/utils/*`, `@/enrichments/providers`, and the types. **Never import `@/tools`** here — the runner does the tool call. +- `buildParams` returns `null` when inputs are insufficient (provider skipped). `mapOutput` returns `null`/empty for a miss (falls through). Use `filterUndefined` when assembling optional tool params; coerce numbers explicitly (don't pass `''` to number outputs). +- Output `id`s are the keys `mapOutput` returns; output `name`s are the default column names (the user can rename them in the config). + +## Step 4: Register it + +In `apps/sim/enrichments/registry.ts`, import and add the entry (catalog order is registration order): + +```typescript +import { myEnrichment } from '@/enrichments/my-enrichment' + +export const ENRICHMENT_REGISTRY: EnrichmentRegistry = { + // ...existing + [myEnrichment.id]: myEnrichment, +} +``` + +## Step 5: Verify + +1. `bunx tsc --noEmit` (from `apps/sim`, `NODE_OPTIONS=--max-old-space-size=8192`) and `bunx biome check` on the changed files. +2. In a table → **+ New column → Enrichments** → pick the new enrichment, map its inputs to columns, name the output column(s), Save. Confirm it appears in the catalog with its icon/description. +3. With hosted keys (or a workspace BYOK key) configured for each provider's service, run a row and confirm the cell fills; the dev-server log shows `Enrichment hit { provider }`. A row whose providers all miss completes blank; a row where every provider errored shows an error cell. + +## Checklist + +- [ ] Each output mapped to a real tool field (verified against the tool's `params`/`outputs`) +- [ ] **Every provider tool has a `hosting` block — ran `/add-hosted-key` for any that didn't** +- [ ] Providers ordered cheapest / most-likely-first; Apollo/LinkedIn not used +- [ ] Enrichment file is client-safe (no `@/tools` import); uses `toolProvider` + shared helpers +- [ ] `buildParams` returns `null` on insufficient inputs; `mapOutput` returns `null` on a miss +- [ ] Registered in `enrichments/registry.ts` +- [ ] tsc + biome clean; created and ran the column end-to-end diff --git a/.agents/skills/add-enrichment/agents/openai.yaml b/.agents/skills/add-enrichment/agents/openai.yaml new file mode 100644 index 00000000000..f19413c0a44 --- /dev/null +++ b/.agents/skills/add-enrichment/agents/openai.yaml @@ -0,0 +1,5 @@ +interface: + display_name: "Add Enrichment" + short_description: "Build a table enrichment cascade" + brand_color: "#16A34A" + default_prompt: "Use $add-enrichment to add a code-defined Sim table enrichment backed by a provider cascade." diff --git a/.agents/skills/add-hosted-key/SKILL.md b/.agents/skills/add-hosted-key/SKILL.md new file mode 100644 index 00000000000..2cb257c2060 --- /dev/null +++ b/.agents/skills/add-hosted-key/SKILL.md @@ -0,0 +1,296 @@ +--- +name: add-hosted-key +description: Add hosted API key support to a tool so Sim provides the key (metered and billed to the workspace) when a user has not brought their own. Use when adding a `hosting` config to a tool under `apps/sim/tools/{service}/`. +--- + +# Adding Hosted Key Support to a Tool + +When a tool has hosted key support, Sim provides its own API key if the user hasn't configured one (via BYOK or env var). Usage is metered and billed to the workspace. + +## Overview + +| Step | What | Where | +|------|------|-------| +| 1 | Register BYOK provider ID | `tools/types.ts`, `app/api/workspaces/[id]/byok-keys/route.ts` | +| 2 | Research the API's pricing and rate limits | API docs / pricing page (before writing any code) | +| 3 | Add `hosting` config to the tool | `tools/{service}/{action}.ts` | +| 4 | Hide API key field when hosted | `blocks/blocks/{service}.ts` | +| 5 | Add to BYOK settings UI | BYOK settings component (`byok.tsx`) | +| 6 | Summarize pricing and throttling comparison | Output to user (after all code changes) | + +## Step 1: Register the BYOK Provider ID + +Add the new provider to the `BYOKProviderId` union in `tools/types.ts`: + +```typescript +export type BYOKProviderId = + | 'openai' + | 'anthropic' + // ...existing providers + | 'your_service' +``` + +Then add it to `VALID_PROVIDERS` in `app/api/workspaces/[id]/byok-keys/route.ts`: + +```typescript +const VALID_PROVIDERS = ['openai', 'anthropic', 'google', 'mistral', 'your_service'] as const +``` + +## Step 2: Research the API's Pricing Model and Rate Limits + +**Before writing any `getCost` or `rateLimit` code**, look up the service's official documentation for both pricing and rate limits. You need to understand: + +### Pricing + +1. **How the API charges** — per request, per credit, per token, per step, per minute, etc. +2. **Whether the API reports cost in its response** — look for fields like `creditsUsed`, `costDollars`, `tokensUsed`, or similar in the response body or headers +3. **Whether cost varies by endpoint/options** — some APIs charge more for certain features (e.g., Firecrawl charges 1 credit/page base but +4 for JSON format, +4 for enhanced mode) +4. **The dollar-per-unit rate** — what each credit/token/unit costs in dollars on our plan + +### Rate Limits + +1. **What rate limits the API enforces** — requests per minute/second, tokens per minute, concurrent requests, etc. +2. **Whether limits vary by plan tier** — free vs paid vs enterprise often have different ceilings +3. **Whether limits are per-key or per-account** — determines whether adding more hosted keys actually increases total throughput +4. **What the API returns when rate limited** — HTTP 429, `Retry-After` header, error body format, etc. +5. **Whether there are multiple dimensions** — some APIs limit both requests/min AND tokens/min independently + +Search the API's docs/pricing page (use WebSearch/WebFetch). Capture the pricing model as a comment in `getCost` so future maintainers know the source of truth. + +### Setting Our Rate Limits + +Our rate limiter (`lib/core/rate-limiter/hosted-key/`) uses a token-bucket algorithm applied **per billing actor** (workspace). It supports two modes: + +- **`per_request`** — simple; just `requestsPerMinute`. Good when the API charges flat per-request or cost doesn't vary much. +- **`custom`** — `requestsPerMinute` plus additional `dimensions` (e.g., `tokens`, `search_units`). Each dimension has its own `limitPerMinute` and an `extractUsage` function that reads actual usage from the response. Use when the API charges on a variable metric (tokens, credits) and you want to cap that metric too. + +When choosing values for `requestsPerMinute` and any dimension limits: + +- **Stay well below the API's per-key limit** — our keys are shared across all workspaces. If the API allows 60 RPM per key and we have 3 keys, the global ceiling is ~180 RPM. Set the per-workspace limit low enough (e.g., 20-60 RPM) that many workspaces can coexist without collectively hitting the API's ceiling. +- **Account for key pooling** — our round-robin distributes requests across `N` hosted keys, so the effective API-side rate per key is `(total requests) / N`. But per-workspace limits are enforced *before* key selection, so they apply regardless of key count. +- **Prefer conservative defaults** — it's easy to raise limits later but hard to claw back after users depend on high throughput. + +## Step 3: Add `hosting` Config to the Tool + +Add a `hosting` object to the tool's `ToolConfig`. This tells the execution layer how to acquire hosted keys, calculate cost, and rate-limit. + +```typescript +hosting: { + envKeyPrefix: 'YOUR_SERVICE_API_KEY', + apiKeyParam: 'apiKey', + byokProviderId: 'your_service', + pricing: { + type: 'custom', + getCost: (_params, output) => { + if (output.creditsUsed == null) { + throw new Error('Response missing creditsUsed field') + } + const creditsUsed = output.creditsUsed as number + const cost = creditsUsed * 0.001 // dollars per credit + return { cost, metadata: { creditsUsed } } + }, + }, + rateLimit: { + mode: 'per_request', + requestsPerMinute: 100, + }, +}, +``` + +### Hosted Key Env Var Convention + +Keys use a numbered naming pattern driven by a count env var: + +``` +YOUR_SERVICE_API_KEY_COUNT=3 +YOUR_SERVICE_API_KEY_1=sk-... +YOUR_SERVICE_API_KEY_2=sk-... +YOUR_SERVICE_API_KEY_3=sk-... +``` + +The `envKeyPrefix` value (`YOUR_SERVICE_API_KEY`) determines which env vars are read at runtime. Adding more keys only requires bumping the count and adding the new env var. + +### Pricing: Prefer API-Reported Cost + +Always prefer using cost data returned by the API (e.g., `creditsUsed`, `costDollars`). This is the most accurate because it accounts for variable pricing tiers, feature modifiers, and plan-level discounts. + +**When the API reports cost** — use it directly and throw if missing: + +```typescript +pricing: { + type: 'custom', + getCost: (params, output) => { + if (output.creditsUsed == null) { + throw new Error('Response missing creditsUsed field') + } + // $0.001 per credit — from https://example.com/pricing + const cost = (output.creditsUsed as number) * 0.001 + return { cost, metadata: { creditsUsed: output.creditsUsed } } + }, +}, +``` + +**When the API does NOT report cost** — compute it from params/output based on the pricing docs, but still validate the data you depend on: + +```typescript +pricing: { + type: 'custom', + getCost: (params, output) => { + if (!Array.isArray(output.searchResults)) { + throw new Error('Response missing searchResults, cannot determine cost') + } + // Serper: 1 credit for <=10 results, 2 credits for >10 — from https://serper.dev/pricing + const credits = Number(params.num) > 10 ? 2 : 1 + return { cost: credits * 0.001, metadata: { credits } } + }, +}, +``` + +**`getCost` must always throw** if it cannot determine cost. Never silently fall back to a default — this would hide billing inaccuracies. + +### Capturing Cost Data from the API + +If the API returns cost info, capture it in `transformResponse` so `getCost` can read it from the output: + +```typescript +transformResponse: async (response: Response) => { + const data = await response.json() + return { + success: true, + output: { + results: data.results, + creditsUsed: data.creditsUsed, // pass through for getCost + }, + } +}, +``` + +For async/polling tools, capture it in `postProcess` when the job completes: + +```typescript +if (jobData.status === 'completed') { + result.output = { + data: jobData.data, + creditsUsed: jobData.creditsUsed, + } +} +``` + +## Step 4: Hide the API Key Field When Hosted + +In the block config (`blocks/blocks/{service}.ts`), add `hideWhenHosted: true` to the API key subblock. This hides the field on hosted Sim since the platform provides the key: + +```typescript +{ + id: 'apiKey', + title: 'API Key', + type: 'short-input', + placeholder: 'Enter your API key', + password: true, + required: true, + hideWhenHosted: true, +}, +``` + +The visibility is controlled by `isSubBlockHidden()` in `lib/workflows/subblocks/visibility.ts`, which checks both the `isHosted` feature flag (`hideWhenHosted`) and optional env var conditions (`hideWhenEnvSet`). + +### Excluding Specific Operations from Hosted Key Support + +When a block has multiple operations but some operations should **not** use a hosted key (e.g., the underlying API is deprecated, unsupported, or too expensive), use the **duplicate apiKey subblock** pattern. This is the same pattern Exa uses for its `research` operation: + +1. **Remove the `hosting` config** from the tool definition for that operation — it must not have a `hosting` object at all. +2. **Duplicate the `apiKey` subblock** in the block config with opposing conditions: + +```typescript +// API Key — hidden when hosted for operations with hosted key support +{ + id: 'apiKey', + title: 'API Key', + type: 'short-input', + placeholder: 'Enter your API key', + password: true, + required: true, + hideWhenHosted: true, + condition: { field: 'operation', value: 'unsupported_op', not: true }, +}, +// API Key — always visible for unsupported_op (no hosted key support) +{ + id: 'apiKey', + title: 'API Key', + type: 'short-input', + placeholder: 'Enter your API key', + password: true, + required: true, + condition: { field: 'operation', value: 'unsupported_op' }, +}, +``` + +Both subblocks share the same `id: 'apiKey'`, so the same value flows to the tool. The conditions ensure only one is visible at a time. The first has `hideWhenHosted: true` and shows for all hosted operations; the second has no `hideWhenHosted` and shows only for the excluded operation — meaning users must always provide their own key for that operation. + +To exclude multiple operations, use an array: `{ field: 'operation', value: ['op_a', 'op_b'] }`. + +**Reference implementations:** +- **Exa** (`blocks/blocks/exa.ts`): `research` operation excluded from hosting — lines 309-329 +- **Google Maps** (`blocks/blocks/google_maps.ts`): `speed_limits` operation excluded from hosting (deprecated Roads API) + +## Step 5: Add to the BYOK Settings UI + +Add an entry to the `PROVIDERS` array in the BYOK settings component so users can bring their own key. You need the service icon from `components/icons.tsx`: + +```typescript +{ + id: 'your_service', + name: 'Your Service', + icon: YourServiceIcon, + description: 'What this service does', + placeholder: 'Enter your API key', +}, +``` + +## Step 6: Summarize Pricing and Throttling Comparison + +After all code changes are complete, output a detailed summary to the user covering: + +### What to include + +1. **API's pricing model** — how the service charges (per token, per credit, per request, etc.), the specific rates found in docs, and whether the API reports cost in responses. +2. **Our `getCost` approach** — how we calculate cost, what fields we depend on, and any assumptions or estimates (especially when the API doesn't report exact dollar cost). +3. **API's rate limits** — the documented limits (RPM, TPM, concurrent, etc.), which plan tier they apply to, and whether they're per-key or per-account. +4. **Our `rateLimit` config** — what we set for `requestsPerMinute` (and dimensions if custom mode), why we chose those values, and how they compare to the API's limits. +5. **Key pooling impact** — how many hosted keys we expect, and how round-robin distribution affects the effective per-key rate at the API. +6. **Gaps or risks** — anything the API charges for that we don't meter, rate limit dimensions we chose not to enforce, or pricing that may be inaccurate due to variable model/tier costs. + +### Format + +Present this as a structured summary with clear headings. Example: + +``` +### Pricing +- **API charges**: $X per 1M tokens (input), $Y per 1M tokens (output) — varies by model +- **Response reports cost?**: No — only token counts in `usage` field +- **Our getCost**: Estimates cost at $Z per 1M total tokens based on median model pricing +- **Risk**: Actual cost varies by model; our estimate may over/undercharge for cheap/expensive models + +### Throttling +- **API limits**: 300 RPM per key (paid tier), 60 RPM (free tier) +- **Per-key or per-account**: Per key — more keys = more throughput +- **Our config**: 60 RPM per workspace (per_request mode) +- **With N keys**: Effective per-key rate is (total RPM across workspaces) / N +- **Headroom**: Comfortable — even 10 active workspaces at full rate = 600 RPM / 3 keys = 200 RPM per key, under the 300 RPM API limit +``` + +This summary helps reviewers verify that the pricing and rate limiting are well-calibrated and surfaces any risks that need monitoring. + +## Checklist + +- [ ] Provider added to `BYOKProviderId` in `tools/types.ts` +- [ ] Provider added to `VALID_PROVIDERS` in the BYOK keys API route +- [ ] API pricing docs researched — understand per-unit cost and whether the API reports cost in responses +- [ ] API rate limits researched — understand RPM/TPM limits, per-key vs per-account, and plan tiers +- [ ] `hosting` config added to the tool with `envKeyPrefix`, `apiKeyParam`, `byokProviderId`, `pricing`, and `rateLimit` +- [ ] `getCost` throws if required cost data is missing from the response +- [ ] Cost data captured in `transformResponse` or `postProcess` if API provides it +- [ ] `hideWhenHosted: true` added to the API key subblock in the block config +- [ ] Provider entry added to the BYOK settings UI with icon and description +- [ ] Env vars documented: `{PREFIX}_COUNT` and `{PREFIX}_1..N` +- [ ] Pricing and throttling summary provided to reviewer diff --git a/.agents/skills/add-hosted-key/agents/openai.yaml b/.agents/skills/add-hosted-key/agents/openai.yaml new file mode 100644 index 00000000000..97dfcf458cf --- /dev/null +++ b/.agents/skills/add-hosted-key/agents/openai.yaml @@ -0,0 +1,5 @@ +interface: + display_name: "Add Hosted Key" + short_description: "Add hosted API key to a tool" + brand_color: "#CA8A04" + default_prompt: "Use $add-hosted-key to add hosted API key support (metered and billed) to a Sim tool." diff --git a/.agents/skills/add-model/SKILL.md b/.agents/skills/add-model/SKILL.md new file mode 100644 index 00000000000..f97c966e1f3 --- /dev/null +++ b/.agents/skills/add-model/SKILL.md @@ -0,0 +1,209 @@ +--- +name: add-model +description: Add a new LLM model to `apps/sim/providers/models.ts` with every pricing and capability value verified against the provider's live API docs (no hallucination), plus the repo-side touchpoints that are not data-driven — hosted-key billing, tests, and provider-code handling. Use when adding a model to an existing provider in `apps/sim/providers/models.ts`. +--- + +# Add Model Skill + +You add a new model entry to `apps/sim/providers/models.ts`. **Every numeric and capability claim MUST be derived from a live web fetch of the provider's official docs in this session.** Marketing emails, training data, and your prior knowledge are not sources of truth — they routinely hallucinate pricing, context windows, and capability lists. + +## Hard rules (do not skip) + +1. **Live-fetch or refuse.** Before writing the entry, you must successfully WebFetch the provider's official models/pricing page in this session. If you cannot reach an authoritative source for any field, **mark the field as UNVERIFIED in your report and ask the user before guessing**. Never fill in pricing or capabilities from memory. +2. **Two-source rule for pricing.** Cross-check input/output/cached pricing against at least one secondary source (OpenRouter, Artificial Analysis, CloudPrice, mem0, intuitionlabs). If sources disagree, the provider's own docs win — but flag the disagreement. +3. **Read the code before setting capability flags.** Capability flags are dead unless the provider's implementation under `apps/sim/providers/{provider}/` actually consumes them (see Consumption Matrix below). Setting a flag the provider ignores is a silent bug. +4. **Cite every fact.** Your final report must list the URL each value came from. No URL → not verified. + +## Your Task + +1. Identify provider and model id from user args +2. Live-fetch official docs + pricing page + capability/parameter pages + at least one secondary source +3. Apply the Consumption Matrix to know which capability flags are real +4. Read 2-3 sibling entries in `models.ts` and match their pattern exactly +5. Check the repo-side touchpoints that are NOT data-driven (hosted-key billing, tests, provider code) +6. Insert the entry, run `bun run lint`, print the verification report + +## Step 1: Live source-of-truth lookup + +In priority order — fetch all that exist for the provider: + +| Provider | Models index | Pricing | Reasoning/parameter caveats | +|---|---|---|---| +| OpenAI | platform.openai.com/docs/models | openai.com/api/pricing | platform.openai.com/docs/guides/reasoning | +| Anthropic | docs.anthropic.com/en/docs/about-claude/models | anthropic.com/pricing | docs.anthropic.com/en/docs/build-with-claude/extended-thinking | +| Google (Gemini) | ai.google.dev/gemini-api/docs/models | ai.google.dev/pricing | ai.google.dev/gemini-api/docs/thinking | +| xAI | docs.x.ai/developers/models | docs.x.ai/developers/models (per-model detail page) | docs.x.ai/developers/model-capabilities/text/reasoning | +| Mistral | docs.mistral.ai/getting-started/models/models_overview | mistral.ai/pricing | n/a | +| DeepSeek | api-docs.deepseek.com/quick_start/pricing | same | api-docs.deepseek.com/guides/reasoning_model | +| Groq | console.groq.com/docs/models | groq.com/pricing | n/a | +| Cerebras | inference-docs.cerebras.ai/models | cerebras.ai/pricing | n/a | + +Secondary verification (use at least one): `openrouter.ai//`, `artificialanalysis.ai/models/`, `cloudprice.net/models/-`. + +Use a precise WebFetch prompt: *"Extract for {model_id}: exact model id string, context window in tokens, input price per 1M, cached input price per 1M, output price per 1M, max output tokens, supported reasoning effort levels, accepted parameters (temperature, top_p), release date. Do not fill in fields you cannot find."* + +## Step 2: Consumption Matrix (which provider honors which capability) + +| Capability | Honored by | Effect if set elsewhere | +|---|---|---| +| `temperature` | All providers (passed through if set) | Safe but inert on always-reasoning models that reject it | +| `toolUsageControl` | All providers (provider-level, not per-model) | n/a — set on `ProviderDefinition`, not models | +| `reasoningEffort` | `openai/core.ts`, `azure-openai`, `anthropic/core.ts` (mapped to thinking), `gemini/core.ts` | **Dead on xai, deepseek, mistral, groq, cerebras, openrouter, fireworks, bedrock, vertex** unless their core consumes it — re-grep before assuming | +| `verbosity` | `openai/core.ts`, `azure-openai/index.ts` only | Dead elsewhere | +| `thinking` | `anthropic/core.ts`, `gemini/core.ts` | Dead elsewhere | +| `nativeStructuredOutputs` | `anthropic/core.ts`, `fireworks/index.ts`, `openrouter/index.ts` | Dead on openai, xai, google, vertex, bedrock, azure-openai, deepseek, mistral, groq, cerebras | +| `maxOutputTokens` | Read by UI + executor for token estimation | Always meaningful — set if provider documents a cap | +| `computerUse` | `anthropic/core.ts` | Dead elsewhere | +| `deepResearch` | UI flag for routing to deep-research SKUs | Set only on actual deep-research model IDs | +| `memory: false` | Conversation persistence opt-out | Set only when model genuinely cannot maintain history (e.g., deep-research) | + +**Always re-grep before relying on this table** — the codebase moves: + +```bash +rg "reasoningEffort|reasoning_effort" apps/sim/providers// +rg "verbosity" apps/sim/providers// +rg "request\.thinking|thinking:" apps/sim/providers// +rg "supportsNativeStructuredOutputs|nativeStructuredOutputs" apps/sim/providers// +``` + +## Step 3: Match the provider's existing entry pattern + +Open `apps/sim/providers/models.ts`, find `PROVIDER_DEFINITIONS[].models`, read 2-3 sibling entries. Match field order exactly: + +```ts +{ + id: '', + pricing: { + input: , + cachedInput: , // omit if provider doesn't offer caching + output: , + updatedAt: '', + }, + capabilities: { + // only flags the provider actually consumes — see matrix + }, + contextWindow: , + releaseDate: '', + recommended: true, // only if new flagship; ask user before swapping + speedOptimized: true, // only on smallest/fastest tier + deprecated: true, // only on retired models +} +``` + +### Reseller providers (azure-openai, azure-anthropic, vertex, bedrock, openrouter) + +Model id MUST be prefixed: `azure/`, `azure-anthropic/`, `vertex/`, `bedrock/`, `openrouter/`. Pricing usually mirrors the upstream provider but verify on the reseller's own pricing page. + +### Insertion order + +Within a family, newest first (matches existing convention: GPT-5.5 above GPT-5.4 above GPT-5.2). Across families, biggest/flagship at top of list. + +### `recommended` / `speedOptimized` + +- At most one or two `recommended: true` per provider — the current flagship(s). +- If you're adding a new flagship, ask the user before removing `recommended` from the previous flagship. Never silently flip it. +- `speedOptimized: true` only on the smallest/fastest tier (nano, flash-lite, haiku class). + +## Step 4: Repo-side touchpoints beyond the entry + +Adding the `models.ts` entry is most of the job because nearly every consumer is **data-driven** and picks the model up automatically: the ~40 query helpers in `models.ts` / `providers/utils.ts`, the public `/models` catalog (`app/(landing)/models/utils.ts` iterates `PROVIDER_DEFINITIONS`), the agent-block model dropdown, and copilot's `isKnownModelId` / `suggestModelIdsForUnknownModel` validation. The touchpoints below are the exceptions — they are **not** data-driven, so check each one. + +### Hosted = auto-billed, by provider + +`getHostedModels()` in `apps/sim/providers/models.ts` returns **every** model under `openai`, `anthropic`, and `google`: + +```ts +export function getHostedModels(): string[] { + return [ + ...getProviderModels('openai'), + ...getProviderModels('anthropic'), + ...getProviderModels('google'), + ] +} +``` + +So a model added to any of those three providers is **automatically served with Sim's rotating hosted key and billed** to the workspace via `shouldBillModelUsage()` (`providers/utils.ts`). Before you insert: + +- **If the model should be BYOK-only / never-billed**, do NOT drop it under `openai`/`anthropic`/`google` as-is — that silently enrolls it in hosted billing. Confirm hosting/billing intent with the user. (Precedent: Ollama Cloud is a deliberately separate `isReseller` provider specifically to stay BYOK-only/never-billed.) +- **If the model should be hosted**, the deployment must actually have a key for it — the provider's `{PREFIX}_COUNT` / `{PREFIX}_1..N` env vars must be set, or hosted runs fail at execution time. +- State the hosted/billing status explicitly in the verification report. + +### Tests with hardcoded model IDs + +`bun run lint` does **not** run tests. A few tests assert specific model IDs and can break or need updating when you touch a hosted or flagship model: + +- `apps/sim/providers/utils.test.ts` — asserts membership of `getHostedModels()` / `shouldBillModelUsage()` +- `apps/sim/providers/index.test.ts` and serializer tests — reference concrete model IDs + +```bash +rg "|getHostedModels|shouldBillModelUsage" apps/sim/providers/*.test.ts +``` + +If anything matches, run the affected provider tests and update assertions as needed. + +### New API behavior is NOT data-driven + +The Consumption Matrix (Step 2) tells you which capability *flags* are honored by existing provider code. But if the new model needs **net-new** request handling that the provider doesn't implement yet — a new beta header (e.g. Anthropic's `anthropic-beta` structured-outputs header in `anthropic/index.ts`), a new thinking/reasoning encoding, a Responses-API quirk — you must edit `apps/sim/providers//core.ts` / `index.ts`. Setting a flag whose behavior isn't implemented is a silent no-op. + +### Wrong family entirely? + +- **Embedding or rerank model** → it does NOT go in the `models[]` array. Use `EMBEDDING_MODEL_PRICING` / `RERANK_MODEL_PRICING` in `models.ts` instead. +- **Brand-new provider** (not just a new model under an existing one) → much larger surface: add the id to `ProviderId` in `providers/types.ts`, a registry entry in `providers/registry.ts`, a provider implementation under `providers//`, an icon in `components/icons.tsx`, and the `PROVIDER_DEFINITIONS` block. That is beyond this skill — tell the user. + +## Step 5: Write, lint + +```bash +bun run lint +``` + +Lint must pass before reporting done. **If lint fails:** read the error, fix the syntax/typing issue in the entry you just wrote (do not delete the entry — it's the work product), re-run lint, and note the fix in a "Lint adjustments" line in the verification report. Never report done with lint failing. + +## Step 6: Verification report (mandatory format) + +End with this exact structure: + +```markdown +### Verification — + +| Field | Value | Source URL | Status | +|---|---|---|---| +| `id` | `grok-4.3` | https://docs.x.ai/... | ✓ verified | +| `contextWindow` | 1,000,000 | https://docs.x.ai/... + https://openrouter.ai/... | ✓ verified (2 sources agree) | +| `input` | $1.25/M | https://docs.x.ai/... | ✓ verified | +| `cachedInput` | $0.20/M | https://cloudprice.net/... | ⚠️ single source | +| `output` | $2.50/M | https://docs.x.ai/... + https://openrouter.ai/... | ✓ verified | +| `capabilities.temperature` | `{ min: 0, max: 1 }` | matches sibling entries | — pattern-match only | +| `capabilities.reasoningEffort` | NOT SET | provider docs say API rejects it for this model | ✓ correctly omitted | +| `releaseDate` | 2026-04-30 | https://docs.x.ai/... announcement | ✓ verified | +| hosted/billing | BYOK-only (xai not in `getHostedModels`) | `providers/models.ts` | — confirmed intent | + +**Disagreements** +- _none_ OR _OpenRouter says X, provider docs say Y — used Y per provider rule_ + +**Unverified fields** +- _none_ OR _: could not find authoritative source — left as based on sibling pattern; please confirm_ +``` + +If any row is ⚠️ single-source or "unverified," **state it plainly to the user and ask whether to proceed**. Do not silently merge. + +## What to do if you cannot find a source + +Omitting a field is **not the same as verifying it**. Any field you cannot confirm from a live fetch must be **both** omitted from the entry **and** listed as ❓ UNVERIFIED in the report's "Unverified fields" section, with the URLs you attempted. Then ask the user to confirm before merging. + +- Pricing missing → do NOT guess. Omit `cachedInput`. Mark ❓ UNVERIFIED. Ask the user for the price or the docs URL. +- Context window missing → do NOT guess. Ask the user; mark ❓ UNVERIFIED. +- Release date missing → omit the field; mark ❓ UNVERIFIED in the report. +- Capability uncertain → omit the flag (safer than setting a dead/wrong one); mark ❓ UNVERIFIED so the user knows you didn't confirm it either way. + +## Anti-patterns this skill exists to prevent + +- ❌ Trusting a marketing email (xAI's grok-4.3 email claimed "3 reasoning efforts" but the API rejects `reasoning_effort` — verified by official docs only) +- ❌ Setting `nativeStructuredOutputs: true` on xai/openai/google (dead — only anthropic/fireworks/openrouter consume it) +- ❌ Setting `thinking` on non-Anthropic/non-Gemini providers +- ❌ Setting `verbosity` on anything other than OpenAI gpt-5.x +- ❌ Copying `pricing.updatedAt` from a sibling instead of using today's date +- ❌ Inventing a `cachedInput` price by dividing input by 4 (varies by provider — find an explicit number) +- ❌ Stamping `recommended: true` on the new model without removing it from the previous flagship +- ❌ Adding a BYOK-only model under `openai`/`anthropic`/`google` (silently enrolls it in hosted billing via `getHostedModels()`) +- ❌ Reporting "done" after only `bun run lint` when you touched a hosted (openai/anthropic/google) or flagship model with assertions in `providers/utils.test.ts` +- ❌ Reporting "done" with any UNVERIFIED row in the table diff --git a/.agents/skills/add-model/agents/openai.yaml b/.agents/skills/add-model/agents/openai.yaml new file mode 100644 index 00000000000..4ba3fc233e8 --- /dev/null +++ b/.agents/skills/add-model/agents/openai.yaml @@ -0,0 +1,5 @@ +interface: + display_name: "Add Model" + short_description: "Add an LLM model, specs verified" + brand_color: "#0EA5E9" + default_prompt: "Use $add-model to add a new LLM model to Sim with pricing and capabilities verified against the provider's live docs." diff --git a/.agents/skills/council/SKILL.md b/.agents/skills/council/SKILL.md new file mode 100644 index 00000000000..433be332ebd --- /dev/null +++ b/.agents/skills/council/SKILL.md @@ -0,0 +1,12 @@ +--- +name: council +description: Spawn parallel task agents to explore a given area of the codebase from multiple angles, then use their findings to answer the question or build a plan. Use when a task needs broad fan-out exploration across many files before acting. +--- + +Based on the given area of interest, please: + +1. Dig around the codebase in terms of that given area of interest, gather general information such as keywords and architecture overview. +2. Spawn off n=10 (unless specified otherwise) task agents to dig deeper into the codebase in terms of that given area of interest, some of them should be out of the box for variance. +3. Once the task agents are done, use the information to do what the user wants. + +If user is in plan mode, use the information to create the plan. diff --git a/.agents/skills/validate-model/SKILL.md b/.agents/skills/validate-model/SKILL.md new file mode 100644 index 00000000000..4a02c0ad779 --- /dev/null +++ b/.agents/skills/validate-model/SKILL.md @@ -0,0 +1,170 @@ +--- +name: validate-model +description: Validate a model entry (or every model in a provider) in `apps/sim/providers/models.ts` against the provider's live API docs, reporting pricing and capability drift, dead capability flags, hosting/billing intent, and any field that cannot be verified. Use when auditing or repairing model entries under `apps/sim/providers/models.ts`. +--- + +# Validate Model Skill + +You audit one or more model entries in `apps/sim/providers/models.ts` against the provider's official live API docs. **Hallucinated pricing and capabilities are the #1 failure mode in this file.** Every numeric and capability claim must be re-derived from a live web fetch in this session — not from memory, not from training data, not from the user's marketing email. + +## Hard rules (do not skip) + +1. **Live-fetch or report unverified.** Each field must be backed by a live WebFetch in this session. If you cannot reach an authoritative URL for a field, mark it **UNVERIFIED** in the report — do not silently confirm it from memory. +2. **Cite every fact.** Every value in the report must show the source URL it was checked against. No URL → mark UNVERIFIED. +3. **Two-source rule for pricing.** Cross-check input/output/cached against at least one secondary source (OpenRouter, Artificial Analysis, CloudPrice). If sources disagree, the provider's own docs win — flag the disagreement. +4. **Inspect provider implementation before flagging capability mismatches.** A capability flag in `models.ts` is dead unless the provider's code under `apps/sim/providers/{provider}/` consumes it (see Consumption Matrix below). Setting a flag the provider ignores is a warning, not a critical. +5. **Never auto-fix without printing the diff.** Show the user the proposed diff before applying. Get confirmation. + +## Your Task + +When invoked as `/validate-model [model-id]`: + +1. Read the target entries from `models.ts` +2. Live-fetch the provider's official models, pricing, and capability/reasoning pages + at least one secondary source for pricing +3. Inspect the provider implementation to know which flags are actually consumed +4. Run the checklist below per model +5. Report findings (critical / warning / suggestion / unverified) with every cell linked to its source URL +6. Offer to fix; on confirm, edit `models.ts` in a single pass and re-lint + +If `model-id` is omitted, validate every model in the provider. + +## Step 1: Read entries from `models.ts` + +Capture per model: `id`, full `pricing`, full `capabilities`, `contextWindow`, `releaseDate`, `recommended`, `speedOptimized`, `deprecated`. + +## Step 2: Live-fetch authoritative sources + +Use the canonical provider URL table in `add-model.md` (Step 1) as the single source of truth — fetch the models index, pricing, and reasoning/parameter caveats pages listed there for the target provider. If you update one table, update the other in the same change. + +Secondary cross-check (use at least one): OpenRouter, Artificial Analysis, CloudPrice. + +If a fetch fails (404, timeout, paywall), record the URL attempted and mark dependent fields UNVERIFIED. + +## Step 3: Build the consumption map for this provider + +Re-grep before trusting the snapshot below: + +```bash +rg "reasoningEffort|reasoning_effort" apps/sim/providers// +rg "verbosity" apps/sim/providers// +rg "request\.thinking|thinking:" apps/sim/providers// +rg "supportsNativeStructuredOutputs|nativeStructuredOutputs" apps/sim/providers// +``` + +Snapshot (verify before relying): + +| Capability | Consumed by | +|---|---| +| `reasoningEffort` | `openai/core.ts`, `azure-openai`, `anthropic/core.ts` (mapped via thinking), `gemini/core.ts` | +| `verbosity` | `openai/core.ts`, `azure-openai/index.ts` | +| `thinking` | `anthropic/core.ts`, `gemini/core.ts` | +| `nativeStructuredOutputs` | `anthropic/core.ts`, `fireworks/index.ts`, `openrouter/index.ts` | +| `computerUse` | `anthropic/core.ts` | +| `temperature` | All providers (passthrough) | + +A flag set in `models.ts` but not in the consumption list for this provider = **warning: dead flag**. + +## Step 4: Run the checklist + +For each model, evaluate every row. Statuses: ✓ matches docs, ✗ disagrees, ⚠️ single-source, ❓ UNVERIFIED (could not fetch). + +### Identity +- [ ] `id` exactly matches provider's API model identifier (case, dots, dashes, prefix for resellers) +- [ ] `releaseDate` matches launch announcement +- [ ] `deprecated: true` set if provider has announced retirement (or removed from active list) + +### Pricing (per 1M tokens, USD) +- [ ] `pricing.input` matches provider pricing page +- [ ] `pricing.output` matches provider pricing page +- [ ] `pricing.cachedInput` matches provider's documented cached/prompt-cache rate (or is correctly omitted if no caching offered) +- [ ] `pricing.updatedAt` is recent — warn if older than 60 days + +### Context & output limits +- [ ] `contextWindow` matches docs (in tokens) +- [ ] `capabilities.maxOutputTokens` matches documented output cap (or is correctly omitted if "no output limit") + +### Capabilities (each must be DOCUMENTED-AS-SUPPORTED **and** CONSUMED-BY-PROVIDER-CODE) +- [ ] `temperature` — provider accepts it for this model (reasoning-always-on models often reject) +- [ ] `reasoningEffort.values` — list matches docs; **omitted** for always-reasoning models that reject the parameter (e.g., grok-4.3, where xAI docs explicitly state `reasoning_effort` is not supported). Verify per model — some always-reasoning models (e.g., OpenAI's o-series) DO accept `reasoning_effort` and should keep the flag. +- [ ] `verbosity.values` — only on OpenAI gpt-5.x family; values match docs +- [ ] `thinking.levels` + `thinking.default` — only on Anthropic/Gemini; values match docs +- [ ] `nativeStructuredOutputs` — only on anthropic/fireworks/openrouter; provider must document Structured Outputs / JSON-mode for this model +- [ ] `toolUsageControl` — provider supports `tool_choice` semantics +- [ ] `computerUse` — provider implements computer-use loop AND model is a computer-use SKU +- [ ] `deepResearch` — only on actual deep-research SKUs +- [ ] `memory: false` — only when the model genuinely cannot maintain conversation history + +### Flags +- [ ] `recommended: true` — at most one or two per provider; should be current flagship +- [ ] `speedOptimized: true` — only on smallest/fastest tier (nano / flash-lite / haiku class) + +### Hosting / billing +- [ ] If the model is under `openai`/`anthropic`/`google`, it is automatically in `getHostedModels()` → served with Sim's rotating key and billed via `shouldBillModelUsage()`. Confirm that is the intent (a BYOK-only model parked under one of these providers is a billing bug — warning). +- [ ] If the model is hosted, the deployment is expected to have its `{PREFIX}_COUNT` / `{PREFIX}_1..N` env vars set (ops concern; note if it looks unset for a model claiming hosted support). + +## Step 5: Report (mandatory format) + +For each model, emit a table with one row per checklist item. Every row that claims ✓ must have a URL. + +```markdown +### Validation — + +| Field | Repo | Live docs | Source URL | Status | +|---|---|---|---|---| +| `input` | $1.25/M | $1.25/M | https://docs.x.ai/... | ✓ | +| `cachedInput` | $0.50/M | $0.20/M | https://cloudprice.net/... | ✗ stale (price cut not picked up) | +| `reasoningEffort` | low/medium/high | rejected by API | https://docs.x.ai/.../reasoning | ✗ inert — selecting silently no-ops | +| `contextWindow` | 1,000,000 | 1,000,000 | https://docs.x.ai/... + https://openrouter.ai/... | ✓ (2 sources) | +| `releaseDate` | 2026-04-30 | not found in scraped pages | _attempted: docs.x.ai, x.ai/news_ | ❓ UNVERIFIED | + +**Findings** +- 🔴 critical — `cachedInput` is wrong: docs say $0.20/M, repo has $0.50/M +- 🟡 warning — `reasoningEffort` is set but provider rejects it for this model (xAI docs explicitly: "reasoning_effort is not supported by grok-4.3") +- 🔵 suggestion — `pricing.updatedAt` is 90 days old; refresh +- ❓ unverified — `releaseDate` could not be confirmed from any fetched page; ask user + +**Disagreements between sources** +- _none_ OR _OpenRouter says $X, provider docs say $Y — went with provider docs_ +``` + +End each multi-model run with a summary count: `N models checked · X critical · Y warnings · Z suggestions · W unverified`. + +## Step 6: Offer to fix + +After reporting, ask: *"Want me to fix the critical and warning items? I'll print the diff first."* On yes: + +1. Print the proposed diff (do not apply yet) +2. Get user confirmation +3. Edit `models.ts` in a single pass +4. Run `bun run lint` +5. Re-run only the failed rows of the checklist on the new state + +## Severity definitions + +- 🔴 **critical** — wrong number or wrong identifier that misleads users about cost or breaks API calls. Examples: incorrect pricing, wrong model id, wrong context window, capability the API rejects. +- 🟡 **warning** — dead code or internal inconsistency. Examples: capability flag the provider ignores, multiple `recommended: true` per provider, `pricing.updatedAt` >60 days old, missing `deprecated: true` on retired model. +- 🔵 **suggestion** — style/consistency. Examples: field order, missing `speedOptimized` on a clearly smallest-tier model. +- ❓ **unverified** — could not fetch an authoritative source for this field. Surface it; never silently confirm. + +## Common bugs this skill catches + +- Pricing drift after a provider price cut (very common — providers cut quarterly) +- `reasoningEffort` set on always-reasoning models that reject the parameter (grok-4.3, o3-pro pattern) +- `nativeStructuredOutputs` set on providers that don't consume the flag (dead) +- `thinking` set on non-Anthropic/non-Gemini providers +- `verbosity` set on non-gpt-5.x models +- Wrong context window (e.g., 128k claimed vs 200k actual) +- Stale `pricing.updatedAt` +- Multiple `recommended: true` per provider after a flagship swap +- Missing `deprecated: true` on retired models (e.g., the xAI batch retiring May 15, 2026) + +## What "I cannot verify this" looks like + +If, after fetching the documented sources, a field cannot be confirmed: + +- Mark the row ❓ UNVERIFIED with the URL(s) attempted +- Surface it in the **Findings** section with severity ❓ +- Do NOT mark the validation as passed +- Ask the user for a docs URL or guidance before changing anything + +The skill is allowed to say *"I could not verify the cached input price for grok-4.3 from the official xAI docs in this session — I attempted [URLs] without finding the value. Third-party sources [URL1, URL2] both report $0.20/M. Confirm before I update."* That is correct behavior. Hallucinating a number is not. diff --git a/.agents/skills/validate-model/agents/openai.yaml b/.agents/skills/validate-model/agents/openai.yaml new file mode 100644 index 00000000000..fa58d32ca8b --- /dev/null +++ b/.agents/skills/validate-model/agents/openai.yaml @@ -0,0 +1,5 @@ +interface: + display_name: "Validate Model" + short_description: "Audit model entries vs live docs" + brand_color: "#0891B2" + default_prompt: "Use $validate-model to audit Sim model entries against the provider's live API docs." diff --git a/.claude/commands/add-model.md b/.claude/commands/add-model.md index 1fcf828537c..c52e1b451f9 100644 --- a/.claude/commands/add-model.md +++ b/.claude/commands/add-model.md @@ -20,7 +20,8 @@ You add a new model entry to `apps/sim/providers/models.ts`. **Every numeric and 2. Live-fetch official docs + pricing page + capability/parameter pages + at least one secondary source 3. Apply the Consumption Matrix to know which capability flags are real 4. Read 2-3 sibling entries in `models.ts` and match their pattern exactly -5. Insert the entry, run `bun run lint`, print the verification report +5. Check the repo-side touchpoints that are NOT data-driven (hosted-key billing, tests, provider code) +6. Insert the entry, run `bun run lint`, print the verification report ## Step 1: Live source-of-truth lookup @@ -103,7 +104,53 @@ Within a family, newest first (matches existing convention: GPT-5.5 above GPT-5. - If you're adding a new flagship, ask the user before removing `recommended` from the previous flagship. Never silently flip it. - `speedOptimized: true` only on the smallest/fastest tier (nano, flash-lite, haiku class). -## Step 4: Write, lint +## Step 4: Repo-side touchpoints beyond the entry + +Adding the `models.ts` entry is most of the job because nearly every consumer is **data-driven** and picks the model up automatically: the ~40 query helpers in `models.ts` / `providers/utils.ts`, the public `/models` catalog (`app/(landing)/models/utils.ts` iterates `PROVIDER_DEFINITIONS`), the agent-block model dropdown, and copilot's `isKnownModelId` / `suggestModelIdsForUnknownModel` validation. The touchpoints below are the exceptions — they are **not** data-driven, so check each one. + +### Hosted = auto-billed, by provider + +`getHostedModels()` in `apps/sim/providers/models.ts` returns **every** model under `openai`, `anthropic`, and `google`: + +```ts +export function getHostedModels(): string[] { + return [ + ...getProviderModels('openai'), + ...getProviderModels('anthropic'), + ...getProviderModels('google'), + ] +} +``` + +So a model added to any of those three providers is **automatically served with Sim's rotating hosted key and billed** to the workspace via `shouldBillModelUsage()` (`providers/utils.ts`). Before you insert: + +- **If the model should be BYOK-only / never-billed**, do NOT drop it under `openai`/`anthropic`/`google` as-is — that silently enrolls it in hosted billing. Confirm hosting/billing intent with the user. (Precedent: Ollama Cloud is a deliberately separate `isReseller` provider specifically to stay BYOK-only/never-billed.) +- **If the model should be hosted**, the deployment must actually have a key for it — the provider's `{PREFIX}_COUNT` / `{PREFIX}_1..N` env vars must be set, or hosted runs fail at execution time. +- State the hosted/billing status explicitly in the verification report. + +### Tests with hardcoded model IDs + +`bun run lint` does **not** run tests. A few tests assert specific model IDs and can break or need updating when you touch a hosted or flagship model: + +- `apps/sim/providers/utils.test.ts` — asserts membership of `getHostedModels()` / `shouldBillModelUsage()` +- `apps/sim/providers/index.test.ts` and serializer tests — reference concrete model IDs + +```bash +rg "|getHostedModels|shouldBillModelUsage" apps/sim/providers/*.test.ts +``` + +If anything matches, run the affected provider tests and update assertions as needed. + +### New API behavior is NOT data-driven + +The Consumption Matrix (Step 2) tells you which capability *flags* are honored by existing provider code. But if the new model needs **net-new** request handling that the provider doesn't implement yet — a new beta header (e.g. Anthropic's `anthropic-beta` structured-outputs header in `anthropic/index.ts`), a new thinking/reasoning encoding, a Responses-API quirk — you must edit `apps/sim/providers//core.ts` / `index.ts`. Setting a flag whose behavior isn't implemented is a silent no-op. + +### Wrong family entirely? + +- **Embedding or rerank model** → it does NOT go in the `models[]` array. Use `EMBEDDING_MODEL_PRICING` / `RERANK_MODEL_PRICING` in `models.ts` instead. +- **Brand-new provider** (not just a new model under an existing one) → much larger surface: add the id to `ProviderId` in `providers/types.ts`, a registry entry in `providers/registry.ts`, a provider implementation under `providers//`, an icon in `components/icons.tsx`, and the `PROVIDER_DEFINITIONS` block. That is beyond this skill — tell the user. + +## Step 5: Write, lint ```bash bun run lint @@ -111,7 +158,7 @@ bun run lint Lint must pass before reporting done. **If lint fails:** read the error, fix the syntax/typing issue in the entry you just wrote (do not delete the entry — it's the work product), re-run lint, and note the fix in a "Lint adjustments" line in the verification report. Never report done with lint failing. -## Step 5: Verification report (mandatory format) +## Step 6: Verification report (mandatory format) End with this exact structure: @@ -128,6 +175,7 @@ End with this exact structure: | `capabilities.temperature` | `{ min: 0, max: 1 }` | matches sibling entries | — pattern-match only | | `capabilities.reasoningEffort` | NOT SET | provider docs say API rejects it for this model | ✓ correctly omitted | | `releaseDate` | 2026-04-30 | https://docs.x.ai/... announcement | ✓ verified | +| hosted/billing | BYOK-only (xai not in `getHostedModels`) | `providers/models.ts` | — confirmed intent | **Disagreements** - _none_ OR _OpenRouter says X, provider docs say Y — used Y per provider rule_ @@ -156,4 +204,6 @@ Omitting a field is **not the same as verifying it**. Any field you cannot confi - ❌ Copying `pricing.updatedAt` from a sibling instead of using today's date - ❌ Inventing a `cachedInput` price by dividing input by 4 (varies by provider — find an explicit number) - ❌ Stamping `recommended: true` on the new model without removing it from the previous flagship +- ❌ Adding a BYOK-only model under `openai`/`anthropic`/`google` (silently enrolls it in hosted billing via `getHostedModels()`) +- ❌ Reporting "done" after only `bun run lint` when you touched a hosted (openai/anthropic/google) or flagship model with assertions in `providers/utils.test.ts` - ❌ Reporting "done" with any UNVERIFIED row in the table diff --git a/.claude/commands/validate-model.md b/.claude/commands/validate-model.md index 10c6aaa0b27..c376f808b28 100644 --- a/.claude/commands/validate-model.md +++ b/.claude/commands/validate-model.md @@ -98,6 +98,10 @@ For each model, evaluate every row. Statuses: ✓ matches docs, ✗ disagrees, - [ ] `recommended: true` — at most one or two per provider; should be current flagship - [ ] `speedOptimized: true` — only on smallest/fastest tier (nano / flash-lite / haiku class) +### Hosting / billing +- [ ] If the model is under `openai`/`anthropic`/`google`, it is automatically in `getHostedModels()` → served with Sim's rotating key and billed via `shouldBillModelUsage()`. Confirm that is the intent (a BYOK-only model parked under one of these providers is a billing bug — warning). +- [ ] If the model is hosted, the deployment is expected to have its `{PREFIX}_COUNT` / `{PREFIX}_1..N` env vars set (ops concern; note if it looks unset for a model claiming hosted support). + ## Step 5: Report (mandatory format) For each model, emit a table with one row per checklist item. Every row that claims ✓ must have a URL. From 324b596f30fb2950d02f89eefa024328b7092815 Mon Sep 17 00:00:00 2001 From: Waleed Latif Date: Thu, 4 Jun 2026 12:13:07 -0700 Subject: [PATCH 2/2] chore(skills): document council yaml omission and disambiguate validate-model cross-ref --- .agents/skills/council/SKILL.md | 1 + .agents/skills/validate-model/SKILL.md | 2 +- .claude/commands/validate-model.md | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-) diff --git a/.agents/skills/council/SKILL.md b/.agents/skills/council/SKILL.md index 433be332ebd..0df112728be 100644 --- a/.agents/skills/council/SKILL.md +++ b/.agents/skills/council/SKILL.md @@ -1,6 +1,7 @@ --- name: council description: Spawn parallel task agents to explore a given area of the codebase from multiple angles, then use their findings to answer the question or build a plan. Use when a task needs broad fan-out exploration across many files before acting. +# No agents/openai.yaml by design: council is a meta/exploration utility (like cleanup, ship, you-might-not-need-*), not a service-integration builder, so it intentionally ships no standalone agent card. --- Based on the given area of interest, please: diff --git a/.agents/skills/validate-model/SKILL.md b/.agents/skills/validate-model/SKILL.md index 4a02c0ad779..c05b2e3527b 100644 --- a/.agents/skills/validate-model/SKILL.md +++ b/.agents/skills/validate-model/SKILL.md @@ -34,7 +34,7 @@ Capture per model: `id`, full `pricing`, full `capabilities`, `contextWindow`, ` ## Step 2: Live-fetch authoritative sources -Use the canonical provider URL table in `add-model.md` (Step 1) as the single source of truth — fetch the models index, pricing, and reasoning/parameter caveats pages listed there for the target provider. If you update one table, update the other in the same change. +Use the canonical provider URL table in the `add-model` skill (`.claude/commands/add-model.md`, or its mirror `.agents/skills/add-model/SKILL.md`), Step 1, as the single source of truth — fetch the models index, pricing, and reasoning/parameter caveats pages listed there for the target provider. If you update one table, update the other in the same change. Secondary cross-check (use at least one): OpenRouter, Artificial Analysis, CloudPrice. diff --git a/.claude/commands/validate-model.md b/.claude/commands/validate-model.md index c376f808b28..bf1d30745b6 100644 --- a/.claude/commands/validate-model.md +++ b/.claude/commands/validate-model.md @@ -34,7 +34,7 @@ Capture per model: `id`, full `pricing`, full `capabilities`, `contextWindow`, ` ## Step 2: Live-fetch authoritative sources -Use the canonical provider URL table in `add-model.md` (Step 1) as the single source of truth — fetch the models index, pricing, and reasoning/parameter caveats pages listed there for the target provider. If you update one table, update the other in the same change. +Use the canonical provider URL table in the `add-model` skill (`.claude/commands/add-model.md`, or its mirror `.agents/skills/add-model/SKILL.md`), Step 1, as the single source of truth — fetch the models index, pricing, and reasoning/parameter caveats pages listed there for the target provider. If you update one table, update the other in the same change. Secondary cross-check (use at least one): OpenRouter, Artificial Analysis, CloudPrice.