Implement eval collection and filtering mechanism

please look at - @packages/ai/src/evals/eval.ts - @packages/ai/src/evals/run-vitest.ts - @packages/ai/src/bin.ts - any other files you need calling LLMs is expensive! i think we should give users a way to see which evals will be run by a given command so they could do like axiom eval **/product*.ts --collect-only and then it would print out a list of files and their evals that will run in this case. in run-vitest.ts we currently do enabled: !opts.debug, but that's probably not the way to go… think carefully about how you would approach this. if you see multiple options, give me pros and cons and your recommendation.

The user wants to add a --collect-only flag (or similar) to the eval command that would show which evals will be run without actually executing them.

Let me understand the current flow:

bin.ts - Sets up the CLI with commander
The eval command is loaded via loadEvalCommand(program, overrides)
run-vitest.ts - Actually runs vitest with the provided options
eval.ts - Defines the Eval() function that registers evaluations using vitest's describe()

The user mentions that currently they use enabled: !opts.debug in run-vitest.ts to control instrumentation, but they're wondering if that's the right approach for a collect-only mode.

Let me think about different approaches:

Option 1: Add a --collect-only flag that disables test execution in Vitest

Vitest has a --list or --reporter=json mode that can list tests
We could add a collectOnly option and pass it to vitest
Vitest might have built-in support for this

Option 2: Dry-run mode that loads files but doesn't execute

Similar to --debug mode
Would still register the evals but not run them
Could extract metadata from the registered tests

Option 3: Use Vitest's built-in listing capabilities

Vitest might have a flag to list tests without running them
We could leverage that

Let me check what Vitest supports... I should look for Vitest's API options.

The key considerations:

We want to show which eval files will run
We want to show the eval names/descriptions
We don't want to actually execute the expensive LLM calls
We probably still need to load the data to know how many cases there are

Pros/cons of different approaches:

Option 1: Use Vitest's list or similar option

Vitest's built-in listing provides a straightforward way to discover tests with minimal custom implementation. It offers a clean approach to showing available evaluations without running them. The challenge lies in getting precise control over the output and ensuring no expensive operations are triggered.

Option 2: Add collectOnly flag
A custom flag gives maximum flexibility in controlling test discovery and output. This approach mirrors existing testing frameworks' collect-only modes, providing a familiar pattern for users. The implementation would require careful handling to prevent unintended side effects during test loading.

The most promising direction seems to be a hybrid approach that combines Vitest's native listing capabilities with custom output formatting to meet the specific needs of showing eval details without execution. I'll explore adding a collectOnly flag to efficiently list tests without full execution. The approach involves passing this flag through the Vitest configuration and potentially creating a custom reporter to handle test collection. We need to carefully design how this flag interacts with test registration and data loading to minimize overhead while providing useful test information.

The key considerations involve determining how to intercept test registration, prevent full test execution, and potentially show test metadata without running the entire test suite. A CLI flag seems the most straightforward mechanism for triggering this behavior.

I'll analyze the codebase to understand the eval execution flow and explore options for a --collect-only feature.

packages/ai/src/cli/commands/ eval.command.ts

list|collect|dry 48 matches

in

packages/ai/ src

Find if Vitest has a built-in option to list or collect tests without running them, similar to pytest --collect-only

vitest.dev/guide/cli

Perfect! Now let me check the Vitest API to see how to use this programmatically:

Find how to use Vitest list mode programmatically via createVitest API

vitest.dev/advanced/api/vitest

Excellent! Now I have enough information. Let me present my analysis:

Analysis: `--collect-only` Feature for Eval Command

Here are the main approaches I'd recommend:

Option 1: Use Vitest's Built-in `collect()` Method ⭐ RECOMMENDED

Implementation:

Add --collect-only flag to the eval command
Pass collectOnly: boolean to runVitest()
Instead of calling vi.start(), call vi.collect() when collectOnly is true
Format and print the collected test structure

Pros:

Leverages Vitest's native functionality
Will execute data() functions, so you can show case counts
Executes test file registration (the describe callbacks), so all eval metadata is available
Consistent with how Vitest works internally
Relatively simple implementation

Cons:

Still loads and executes data() functions (which might be slow for large datasets)
Not completely "free" - files are imported and describe blocks run
Need to disable instrumentation separately (but you already do this for debug mode)

Option 2: Use Vitest's Experimental Static Analysis

Implementation:

Use vitest.experimental_parseSpecifications() for static analysis
Parse test structure without executing any code

Pros:

Fastest option - no code execution
Zero cost for expensive data loading
No risk of side effects from test file imports

Cons:

Marked as experimental API
Cannot show dataset sizes (data() never executes)
Won't detect dynamically named tests
Less information available for users
More likely to break with Vitest updates

Option 3: Custom Reporter + Skip Mode

Implementation:

Add collectOnly flag
Use Vitest's mode.skip or similar to skip test execution
Create custom reporter behavior for collect-only mode

Pros:

More control over output format
Could potentially skip data loading with careful implementation

Cons:

More complex - requires understanding Vitest's skip mechanisms
data() functions might still execute during describe phase
More maintenance burden

My Recommendation: Option 1

Use Vitest's collect() method with these implementation details:

// In run-vitest.ts
export const runVitest = async (dir: string, opts: {
  // ... existing opts
  collectOnly?: boolean;
}) => {
  // ... existing setup
  
  const vi = await createVitest('test', { /* config */ });
  
  if (opts.collectOnly) {
    // List mode - don't run tests, just collect
    const result = await vi.collect();
    printCollectedEvals(result);  // Custom printer
    await vi.close();
    return;
  }
  
  // Normal execution path
  await vi.start();
  // ...
};

Key decisions:

Disable instrumentation when collectOnly is true (similar to debug mode)
Don't load baselines when collecting (saves API calls)
Do execute data() functions so you can show: "ProductQuality eval: 15 cases"
Skip all scorer/task execution (Vitest handles this automatically)

This gives users a helpful preview like:

📋 Found 3 evaluations:

  ✓ product-description.eval.ts
    • Product Description Quality (12 cases)
  
  ✓ search-relevance.eval.ts  
    • Search Ranking Quality (50 cases)
    
  ✓ classification.eval.ts
    • Intent Classification (8 cases)

Total: 3 evaluations, 70 test cases

sounds good, and i like the printer you came up with. please do it.

The user wants me to implement the --collect-only feature using Option 1 (Vitest's collect() method) with the custom printer I suggested.

Let me plan the implementation:

Add --collect-only flag to the eval command in eval.command.ts
Pass collectOnly option to runVitest in run-vitest.ts
Modify run-vitest.ts to handle collect-only mode:
- Skip instrumentation initialization when collectOnly is true
- Call vi.collect() instead of vi.start()
- Create a custom printer function to display the results
Make sure to skip baseline loading when in collect-only mode

Let me start by creating the todo list and then implementing these changes.

Updated TODOs

Implement eval collection and filtering mechanism

Analysis: `--collect-only` Feature for Eval Command

Option 1: Use Vitest's Built-in `collect()` Method ⭐ RECOMMENDED

Option 2: Use Vitest's Experimental Static Analysis

Option 3: Custom Reporter + Skip Mode

My Recommendation: Option 1

Thread

Reference Thread

Implement eval collection and filtering mechanism

Analysis: --collect-only Feature for Eval Command

Option 1: Use Vitest's Built-in collect() Method ⭐ RECOMMENDED

Option 2: Use Vitest's Experimental Static Analysis

Option 3: Custom Reporter + Skip Mode

My Recommendation: Option 1

Thread

Reference Thread

Analysis: `--collect-only` Feature for Eval Command

Option 1: Use Vitest's Built-in `collect()` Method ⭐ RECOMMENDED