Steven's Site

Tool Calls vs. Parsing Custom Response Tokens

Bolt.new is blazing fast for prototyping web apps. It also feels fast because I can see files streaming through the network as regular chat messages. Bolt uses a special token, boltAction, parsed directly from the LLM response. This allows its web container to synchronously create files as part of the chat response stream. This approach differs from conventional tool calling, which typically requires multiple network calls to stream a file, making Bolt's token-based method feel more responsive.

This technique isn't unique to Bolt. Although I haven't confirmed it yet, I'm fairly certain Cursor employs a similar approach to render code blocks directly from LLM responses. However, Cursor separates this "rendering" from its "Apply" feature, which uses another model to actually update files. This has the benefit of saving tokens on file content that has not changed.

Both editors parse code blocks from LLM responses, but they differ in how they handle file updates—one directly updates files inside <boltAction> tag, while the other delegates updates to a separate tool, each choosing trade-offs suited to their UX goals.

Prompt Design and Token Management

Cursor has open-sourced Priompt, addressing a problem I've found increasingly important as AI applications become more stateful, especially with agents and frameworks like MCP.

Before diving into Priompt, let me outline the pain points it addresses—issues I've encountered personally and experienced in other apps:

Token Budget Management
Crafting prompts isn't trivial, especially when components like tools, system messages, memory, history, and user messages are user-driven. Chat apps like SillyTavern, already impressive in managing token memory, handle token budgets imperatively. I've used similar imperative approaches myself.
Deciding What to Trim
When your chat inevitably approaches the token budget, how do you decide what to remove? Naively deleting the oldest messages until you meet the budget is straightforward but suboptimal. What if you could specify a priority for each prompt component, clearly defining how they should be trimmed to meet your budget?

Priompt leverages JSX that is like XML, allowing us to reuse the familiar "rendering" metaphor to declare what the final prompt should look like given a token budget. I can't write this better than the original thread from 2024, which I discovered a year later as I began experiencing this problem firsthand. I believe this declarative approach is foundational for crafting prompts for serializing app states, like auto-completing sentences or editing files—each benefiting from different context windows and performance characteristics.

The ability to define prompt templates adaptable to various token budgets also gives you more knobs on eval and unlocks the potential to run these use cases across models with different token budgets, significantly improving how effectively we serialize app state into LLM calls.

I look forward to seeing these techniques applied beyond code editors, enabling users like myself to focus on describing my intention when chatting with an AI and without manually "serializing" app state into plain English and hoping for optimal results.

UX magic in code editors

Tool Calls vs. Parsing Custom Response Tokens

Prompt Design and Token Management