Describe the feature or improvement you're requesting
Current State
HumanCliSolver is a CLI-based solver that replaces model inference with human input. The current implementation is clean and minimal, works functionally, but lacks visibility into what the evaluation pipeline is actually doing.
Currently the solver injects the task description as a system message, then appends prior conversation state, as follows:
msgs += task_state.messages
Then it flattens all messages into a single string and appends the CLI input prompt and calls:
"\n".join([f"{msg.role}: {msg.content}" for msg in msgs])
Finally it logs the interaction via record_sampling and returns the result:
record_sampling(prompt=prompt, sampled=answer, model="human")
What is Missing
This design hides important details that matter in evaluation contexts:
- There is no structured view of the task context vs. conversation history.
- The exact final prompt string is not clearly separated or emphasized.
- The sampling metadata (prompt length, answer length, model tag) is not visible.
For human baselines, debugging, or audit-heavy evaluation workflows, this lack of visibility reduces transparency and makes experiments harder to reason about.
Proposed Enhancement: Optional Explainability Mode
Introduce an optional flag:
HumanCliSolver(explain=True)
When enabled, the solver prints structured, clearly separated sections before and after input. Instead of showing only the flattened prompt, display:
================ TASK CONTEXT (system) ================
<task_state.task_description>
================ MESSAGE HISTORY ================
[0] user: ...
[1] assistant: ...
[2] user: ...
================ FINAL PROMPT STRING (exact) ================
<exact prompt passed to input()>
================ AWAITING HUMAN INPUT ================
assistant (you):
After the human responds, show what is being recorded and returned:
================ SAMPLING RECORD ================
model: human
prompt_chars: 1243
answer_chars: 87
================ OUTPUT ================
raw_answer: ...
final_answer: ...
If postprocessing alters the output, display both versions to eliminate hidden transformations.
Suggested Implementation Sketch
raw_answer = input(prompt)
final_answer = raw_answer
if self.explain:
print("=== SAMPLING RECORD ===")
print(f"model: human")
print(f"prompt_chars: {len(prompt)}")
print(f"answer_chars: {len(raw_answer)}")
record_sampling(
prompt=prompt,
sampled=final_answer,
model="human",
)
return SolverResult(final_answer)
Final Comments
HumanCliSolver functions as a human baseline, a prompt debugging tool, and a sanity check within evaluation pipelines. Adding an optional explainability mode would improve transparency, reproducibility, and auditability—without changing default behavior—by making the solver self-documenting when deeper visibility is needed.
Additional context
No response
Describe the feature or improvement you're requesting
Current State
HumanCliSolveris a CLI-based solver that replaces model inference with human input. The current implementation is clean and minimal, works functionally, but lacks visibility into what the evaluation pipeline is actually doing.Currently the solver injects the task description as a
systemmessage, then appends prior conversation state, as follows:Then it flattens all messages into a single string and appends the CLI input prompt and calls:
Finally it logs the interaction via
record_samplingand returns the result:What is Missing
This design hides important details that matter in evaluation contexts:
For human baselines, debugging, or audit-heavy evaluation workflows, this lack of visibility reduces transparency and makes experiments harder to reason about.
Proposed Enhancement: Optional Explainability Mode
Introduce an optional flag:
When enabled, the solver prints structured, clearly separated sections before and after input. Instead of showing only the flattened prompt, display:
After the human responds, show what is being recorded and returned:
If postprocessing alters the output, display both versions to eliminate hidden transformations.
Suggested Implementation Sketch
Final Comments
HumanCliSolver functions as a human baseline, a prompt debugging tool, and a sanity check within evaluation pipelines. Adding an optional explainability mode would improve transparency, reproducibility, and auditability—without changing default behavior—by making the solver self-documenting when deeper visibility is needed.
Additional context
No response