Janet's ducks lay 16 eggs per day. She eats three for breakfast, bakes muffins with four, and sells the rest at the farmers' market for $2 per egg. How much does she make daily?
Unchanged at L0.
"Talk short. Drop grammar. Save token." This caveman register is widely promoted as a way to cut large-language-model inference cost, and a growing literature compresses either the prompt or the response.
We benchmark large language models under linguistic input and output compression. CAVEWOMAN is a two-channel evaluation protocol that scores every generation on task accuracy, realized per-item cost, and surface-text preservation against the model's unconstrained reference.
We evaluate seven models (Qwen2.5-7B, Qwen3.5, DeepSeek-R1-Distill, GPT-4o, GPT-5.4, Claude Sonnet 4.6, and Claude Haiku 4.5) on five reasoning datasets (GSM8K, BoolQ, ARC-Easy, CommonsenseQA, and MMLU-STEM). Both channels are measured on the same items at five matched reduction levels, for a total of 350 (model, channel, dataset, level) cells and more than 780,000 model generations.
Output compression cuts realized per-item cost by roughly 1.5x on average across four API models at the first reduction level, with the best cell about 3x cheaper at matched accuracy. Input compression has the opposite effect and raises cost on nearly every (model, dataset) cell, because models compensate for a shorter prompt with a longer response. Accuracy and reference-text agreement decouple on the same setting: about half of correct answers no longer entail the unconstrained reference. The dissociation survives length-controlled rescoring, FDR correction, and replication under twelve independent semantic measures. We argue that compression should be evaluated channel by channel on realized cost, with answer-extraction rates audited before any accuracy gain is claimed.
CAVEWOMAN evaluates each (model, dataset) pair under two compression channels at five matched reduction levels. The same item is passed through one channel at a time, and the only thing that changes between channels is where the constraint sits.
Channels. Condition A (input compression) filters the user prompt through a deterministic spaCy POS rule at level Lk and uses a neutral system prompt. Condition B (output constraint) leaves the user prompt verbatim and replaces the system prompt with a level-specific register restriction. Levels L0 through L4 progress from unconstrained prose to a hard 15-token budget, and both channels share the same level definitions and the same per-level output cap.
Scoring. Every generation is scored on three families of metrics. Task accuracy uses a strict answer-extraction rule, paired with a per-cell relaxed-rule audit so that wrong-format responses are not silently counted as wrong. Realized per-item cost is computed from input plus output tokens at each provider's posted list prices, rather than from prompt-token reduction alone. Reference-text preservation is measured by bidirectional NLI between each Lk response and the same item's L0 response, with twelve complementary semantic measures used as independent corroboration of the primary judge.
Statistics. Per-cell aggregates are paired with 10,000-resample bootstrap 95 percent confidence intervals at fixed seed. Across-cell comparisons use Benjamini-Hochberg FDR correction. Length-controlled NLI rescoring truncates the L0 reference to the Lk response length before scoring, and a threshold-ordering analysis confirms the channel ordering at NLI thresholds in {0.3, 0.4, 0.5, 0.6, 0.7}.
Pick a level to see its input prompt, its output prompt, and an example response.
Janet's ducks lay 16 eggs per day. She eats three for breakfast, bakes muffins with four, and sells the rest at the farmers' market for $2 per egg. How much does she make daily?
Janet's ducks lay 16 eggs per day. She eats three for breakfast, bakes muffins with four, and sells the rest at the farmers' market for $2 per egg. How much does she make daily?
Unchanged at L0.
Reason step by step in full grammatical English sentences. Conclude with the final answer on its own line in the form Answer: <answer>.
Janet has 16 eggs each day. She uses 3 for breakfast and 4 for muffins, so 7 total. That leaves 16 - 7 = 9 eggs to sell. Each sells for $2, so 9 * 2 = 18. Answer: 18
Janet's ducks lay 16 eggs day. She eats three breakfast, bakes muffins four, sells rest farmers' market $2 egg. How much does she make daily?
Function words dropped: articles, prepositions, conjunctions.
Telegraphic. No articles, conjunctions, or prepositions. Use nouns, main verbs, numbers, and standard symbols. End with Answer: <answer>.
Eggs 16 day. Breakfast 3. Muffins 4. Used 7. Remainder 16 - 7 = 9. Sell 9 * 2 = 18. Answer: 18
Janet ducks lay 16 eggs day eats three breakfast bakes muffins four sells rest farmers market $2 egg make daily
Nouns, main verbs, and numbers only.
Keyword only. Use nouns and main verbs only. No grammar, no full sentences. Output as fragments or list items.
Eggs: 16 Eat: 3, bake: 4 Used: 7 Remaining: 9 Price: $2 Total: $18 Answer: 18
Janet ducks 16 eggs day three breakfast muffins four rest farmers market $2 egg daily
Nouns and numbers only. No verbs.
Noun-phrase skeleton. No verbs of any kind. Nominal fragments, numbers, and standard symbols only.
Eggs per day: 16 Breakfast use: 3 Muffin use: 4 Total use: 7 Remainder: 9 Price per egg: $2 Daily revenue: $18 Answer: 18
Janet's ducks lay 16 eggs per day. She eats three for breakfast, bakes muffins with
First 15 whitespace tokens.
Hard 15-token budget. The entire response must be 15 tokens or fewer. Prefer the raw answer over prose.
Answer: 18
Prompt compression and response constraint are different operations with different cost signatures, even when applied at matched reduction levels. CAVEWOMAN treats them as the two channels of a benchmark and reports realized per-item cost rather than prompt-token reduction. Output compression consistently saves money across our API panel, while input compression typically does not, because models compensate for a shorter prompt with a longer response.
Accuracy is not a sufficient stand-in for semantic preservation. Under either channel at the first reduction level, correct answers can come from responses whose surface text no longer entails the unconstrained reference, and the gap survives length-controlled rescoring and replication under twelve independent semantic measures. Reporting accuracy alone, without extraction-rate auditing, can overstate the effectiveness of a compressor.
CAVEWOMAN is released as an open benchmark with canonical prompts, per-cell summary statistics, the validation suite, paper figures, and the scripts that produced each table. The intended use is to evaluate new compressors and new models channel by channel on realized cost.
We thank MIT Engaging for providing the GPU compute used to run the local-model sweep. We also acknowledge the maintainers of the open-source tools the work depends on, including spaCy, Hugging Face Transformers, and sentence-transformers.
This webpage template is adapted from Nerfies, under a CC BY-SA 4.0 License.
@inproceedings{cavewoman2026,
title = {CAVEWOMAN: How Large Language Models Behave Under
Linguistic Input and Output Compression},
author = {Adeyemi, Morayo Danielle and
Rossi, Ryan and
Dernoncourt, Franck},
year = {2026},
url = {https://github.com/danielle34/cavewoman}
}