Sign In Sign Up

Implement eval collection and filtering mechanism

please look at - @packages/ai/src/evals/eval.ts - @packages/ai/src/evals/run-vitest.ts - @packages/ai/src/bin.ts - any other files you need calling LLMs is expensive! i think we should give users a way to see which evals will be run by a given command so they could do like axiom eval **/product*.ts --collect-only and then it would print out a list of files and their evals that will run in this case. in run-vitest.ts we currently do enabled: !opts.debug, but that's probably not the way to go… think carefully about how you would approach this. if you see multiple options, give me pros and cons and your recommendation.

The user wants to add a --collect-only flag (or similar) to the eval command that would show which evals will be run without actually executing them.

Let me understand the current flow:

  1. bin.ts - Sets up the CLI with commander
  2. The eval command is loaded via loadEvalCommand(program, overrides)
  3. run-vitest.ts - Actually runs vitest with the provided options
  4. eval.ts - Defines the Eval() function that registers evaluations using vitest's describe()

The user mentions that currently they use enabled: !opts.debug in run-vitest.ts to control instrumentation, but they're wondering if that's the right approach for a collect-only mode.

Let me think about different approaches:

Option 1: Add a --collect-only flag that disables test execution in Vitest

  • Vitest has a --list or --reporter=json mode that can list tests
  • We could add a collectOnly option and pass it to vitest
  • Vitest might have built-in support for this

Option 2: Dry-run mode that loads files but doesn't execute

  • Similar to --debug mode
  • Would still register the evals but not run them
  • Could extract metadata from the registered tests

Option 3: Use Vitest's built-in listing capabilities

  • Vitest might have a flag to list tests without running them
  • We could leverage that

Let me check what Vitest supports... I should look for Vitest's API options.

The key considerations:

  1. We want to show which eval files will run
  2. We want to show the eval names/descriptions
  3. We don't want to actually execute the expensive LLM calls
  4. We probably still need to load the data to know how many cases there are

Pros/cons of different approaches:

Option 1: Use Vitest's list or similar option

Vitest's built-in listing provides a straightforward way to discover tests with minimal custom implementation. It offers a clean approach to showing available evaluations without running them. The challenge lies in getting precise control over the output and ensuring no expensive operations are triggered.

Option 2: Add collectOnly flag
A custom flag gives maximum flexibility in controlling test discovery and output. This approach mirrors existing testing frameworks' collect-only modes, providing a familiar pattern for users. The implementation would require careful handling to prevent unintended side effects during test loading.

The most promising direction seems to be a hybrid approach that combines Vitest's native listing capabilities with custom output formatting to meet the specific needs of showing eval details without execution. I'll explore adding a collectOnly flag to efficiently list tests without full execution. The approach involves passing this flag through the Vitest configuration and potentially creating a custom reporter to handle test collection. We need to carefully design how this flag interacts with test registration and data loading to minimize overhead while providing useful test information.

The key considerations involve determining how to intercept test registration, prevent full test execution, and potentially show test metadata without running the entire test suite. A CLI flag seems the most straightforward mechanism for triggering this behavior.

I'll analyze the codebase to understand the eval execution flow and explore options for a --collect-only feature.

list|collect|dry 48 matches

Perfect! Now let me check the Vitest API to see how to use this programmatically:

Excellent! Now I have enough information. Let me present my analysis:

Analysis: --collect-only Feature for Eval Command

Here are the main approaches I'd recommend:

Option 1: Use Vitest's Built-in collect() MethodRECOMMENDED

Implementation:

  • Add --collect-only flag to the eval command
  • Pass collectOnly: boolean to runVitest()
  • Instead of calling vi.start(), call vi.collect() when collectOnly is true
  • Format and print the collected test structure

Pros:

  • Leverages Vitest's native functionality
  • Will execute data() functions, so you can show case counts
  • Executes test file registration (the describe callbacks), so all eval metadata is available
  • Consistent with how Vitest works internally
  • Relatively simple implementation

Cons:

  • Still loads and executes data() functions (which might be slow for large datasets)
  • Not completely "free" - files are imported and describe blocks run
  • Need to disable instrumentation separately (but you already do this for debug mode)

Option 2: Use Vitest's Experimental Static Analysis

Implementation:

  • Use vitest.experimental_parseSpecifications() for static analysis
  • Parse test structure without executing any code

Pros:

  • Fastest option - no code execution
  • Zero cost for expensive data loading
  • No risk of side effects from test file imports

Cons:

  • Marked as experimental API
  • Cannot show dataset sizes (data() never executes)
  • Won't detect dynamically named tests
  • Less information available for users
  • More likely to break with Vitest updates

Option 3: Custom Reporter + Skip Mode

Implementation:

  • Add collectOnly flag
  • Use Vitest's mode.skip or similar to skip test execution
  • Create custom reporter behavior for collect-only mode

Pros:

  • More control over output format
  • Could potentially skip data loading with careful implementation

Cons:

  • More complex - requires understanding Vitest's skip mechanisms
  • data() functions might still execute during describe phase
  • More maintenance burden

My Recommendation: Option 1

Use Vitest's collect() method with these implementation details:

// In run-vitest.ts
export const runVitest = async (dir: string, opts: {
  // ... existing opts
  collectOnly?: boolean;
}) => {
  // ... existing setup
  
  const vi = await createVitest('test', { /* config */ });
  
  if (opts.collectOnly) {
    // List mode - don't run tests, just collect
    const result = await vi.collect();
    printCollectedEvals(result);  // Custom printer
    await vi.close();
    return;
  }
  
  // Normal execution path
  await vi.start();
  // ...
};

Key decisions:

  1. Disable instrumentation when collectOnly is true (similar to debug mode)
  2. Don't load baselines when collecting (saves API calls)
  3. Do execute data() functions so you can show: "ProductQuality eval: 15 cases"
  4. Skip all scorer/task execution (Vitest handles this automatically)

This gives users a helpful preview like:

📋 Found 3 evaluations:

  ✓ product-description.eval.ts
    • Product Description Quality (12 cases)
  
  ✓ search-relevance.eval.ts  
    • Search Ranking Quality (50 cases)
    
  ✓ classification.eval.ts
    • Intent Classification (8 cases)

Total: 3 evaluations, 70 test cases
sounds good, and i like the printer you came up with. please do it.

The user wants me to implement the --collect-only feature using Option 1 (Vitest's collect() method) with the custom printer I suggested.

Let me plan the implementation:

  1. Add --collect-only flag to the eval command in eval.command.ts
  2. Pass collectOnly option to runVitest in run-vitest.ts
  3. Modify run-vitest.ts to handle collect-only mode:
    • Skip instrumentation initialization when collectOnly is true
    • Call vi.collect() instead of vi.start()
    • Create a custom printer function to display the results
  4. Make sure to skip baseline loading when in collect-only mode

Let me start by creating the todo list and then implementing these changes.

Updated TODOs