In standard LLM interactions, the ChatModel operates synchronously; the application sends a prompt and waits until the entire response is generated before receiving any data. This can lead to a sluggish user experience, especially with long responses.
The StreamingChatModel addresses this by streaming the response. It sends back fragments of the response (tokens) as they are generated by the model. This allows the application to display progress to the user in real-time.
StreamingChatModel vs ChatModel
- Responsiveness: Streaming provides immediate feedback, whereas synchronous models have high latency before the response appears.
- Handling:
ChatModel returns a ChatResponse object directly. StreamingChatModel requires a StreamingChatResponseHandler to handle callbacks like onPartialResponse and onCompleteResponse.
Java source
Definition of StreamingChatModelVersion: 1.10.0 package dev.langchain4j.model.chat;
public interface StreamingChatModel {
default void chat(ChatRequest chatRequest,
StreamingChatResponseHandler handler);
default void doChat(ChatRequest chatRequest,
StreamingChatResponseHandler handler);
default ChatRequestParameters defaultRequestParameters();
default List<ChatModelListener> listeners();
default ModelProvider provider();
default void chat(String userMessage,
StreamingChatResponseHandler handler);
default void chat(List<ChatMessage> messages,
StreamingChatResponseHandler handler);
default Set<Capability> supportedCapabilities();
}
Definition of StreamingChatResponseHandlerVersion: 1.10.0 package dev.langchain4j.model.chat.response;
public interface StreamingChatResponseHandler {
default void onPartialResponse(String partialResponse); 1
default void onPartialResponse(PartialResponse partialResponse, 2
PartialResponseContext context);
default void onPartialThinking(PartialThinking partialThinking); 3
default void onPartialThinking(PartialThinking partialThinking, 4
PartialThinkingContext context);
default void onPartialToolCall(PartialToolCall partialToolCall); 5
default void onPartialToolCall(PartialToolCall partialToolCall, 6
PartialToolCallContext context);
default void onCompleteToolCall(CompleteToolCall completeToolCall); 7
void onCompleteResponse(ChatResponse completeResponse); 8
void onError(Throwable error); 9
}
Use cases
- Improving Perceived Latency: Reducing "Time to First Token" by displaying text to the user as it is generated rather than waiting for the entire block.
- Real-time UI Delivery: Pushing updates to web or mobile frontends via Server-Sent Events (SSE) or WebSockets for a "typing" effect.
- Early Content Moderation: Analyzing the incoming stream for "stop words" or prohibited content and cancelling the request immediately if a violation is detected.
- Cost Optimization: Manually triggering a
cancel() if the user interrupts the generation or if the logic determines the model has already provided the necessary answer.
- Progressive Parsing: Starting to render UI components (like tables or markdown headers) or trigger backend logic as soon as specific markers appear in the stream.
- Performance Tracking: Capturing granular metrics such as Tokens Per Second and the exact timestamp of the first response byte for system monitoring.
Example
In following example we are using ollama with phi3:mini-128k model. We are going implement the StreamingChatResponseHandler to print tokens to the console as they arrive.
package com.logicbig.example;
import dev.langchain4j.model.chat.StreamingChatModel;
import dev.langchain4j.model.chat.response.ChatResponse;
import dev.langchain4j.model.chat.response.StreamingChatResponseHandler;
import dev.langchain4j.model.ollama.OllamaStreamingChatModel;
import java.util.concurrent.CountDownLatch;
public class StreamingChatExample {
public static void main(String[] args) throws InterruptedException {
CountDownLatch done = new CountDownLatch(1);
StreamingChatModel model = OllamaStreamingChatModel.builder()
.baseUrl("http://localhost:11434")
.modelName("phi3:mini-128k")
.numCtx(4096)
.temperature(0.7)
.build();
System.out.println("Starting stream...\n");
model.chat("Write a very short poem about Java concurrency.",
new StreamingChatResponseHandler() {
@Override
public void onPartialResponse(String token) {
// This is called every time a new token is generated
System.out.print(token);
}
@Override
public void onCompleteResponse(ChatResponse response) {
System.out.println("\n\nDone!");
done.countDown();
}
@Override
public void onError(Throwable error) {
error.printStackTrace();
done.countDown();
}
});
// Keeping the main thread alive for the async response
done.await();
}
}
OutputStarting stream...
In the realm of threads that dance, Java's Concurrency takes its chance.
Synchronized blocks guard their space, Deadlock avoidance sets pace.
Executors handle tasks with ease, Parallelism to seize and please.
Atomic ops prevent data loss, Thread safety holds us as our boss.
Concurrency in Java' endless quest, To optimize processing at its best.
Done!
Following output recorded on Intellij to show what streaming looks like:
Conclusion
By observing the console output, you will notice that the poem appears word-by-word (or token-by-token) rather than appearing all at once after a long delay. The "Done!" message printed within the onCompleteResponse method confirms that the streaming process finished successfully. This approach is essential for building modern, interactive AI chat interfaces where low perceived latency is critical.
Example ProjectDependencies and Technologies Used: - langchain4j 1.10.0 (Build LLM-powered applications in Java: chatbots, agents, RAG, and much more)
- langchain4j-ollama 1.10.0 (LangChain4j :: Integration :: Ollama)
- JDK 17
- Maven 3.9.11
|