When working with StreamingChatModel, there are scenarios where you might want to stop the LLM from generating further text before it finishes normally. This could be due to security filters, length constraints, or detecting specific keywords in the output.
As we saw in the last tutorial, we need to provide the implementation of StreamingChatResponseHandler . The following method of the interface allows you to cancel an LLM request before it completes:
default void onPartialResponse(PartialResponse partialResponse,
PartialResponseContext context){}
PartialResponseContext provides the access to StreamingHandle via following method:
public StreamingHandle streamingHandle()
StreamingHandle interface includes a cancel() method, which can be used to terminate the request immediately:
void cancel()
Use Cases
- Content Moderation: Terminating the stream if the model begins generating restricted content.
- Early Exit: Stopping a search or list generation once a specific item is found.
- Resource Management: Reducing token usage and cost by stopping unnecessary output.
Example
In the following example, we ask the model to provide prime numbers. We monitor the incoming tokens in the onPartialResponse method and invoke cancel() as soon as a specific number is detected.
package com.logicbig.example;
import dev.langchain4j.model.chat.StreamingChatModel;
import dev.langchain4j.model.chat.response.ChatResponse;
import dev.langchain4j.model.chat.response.PartialResponse;
import dev.langchain4j.model.chat.response.PartialResponseContext;
import dev.langchain4j.model.chat.response.StreamingChatResponseHandler;
import dev.langchain4j.model.ollama.OllamaStreamingChatModel;
import java.util.concurrent.CountDownLatch;
public class StreamingCancelExample {
public static void main(String[] args) throws InterruptedException {
CountDownLatch latch = new CountDownLatch(1);
StreamingChatModel model =
OllamaStreamingChatModel.builder()
.baseUrl("http://localhost:11434")
.modelName("phi3:mini-128k")
.numCtx(4096)
.temperature(0.7)
.build();
System.out.println("Streaming started...");
model.chat("What are the prime numbers between 1 to 13. Ony return numbers.",
new StreamingChatResponseHandler() {
@Override
public void onPartialResponse(PartialResponse partialResponse,
PartialResponseContext context) {
String text = partialResponse.text();
System.out.print(text);
if (text.contains("7")) {
System.out.println("\n[Condition met. Cancelling...]");
context.streamingHandle().cancel();
latch.countDown();
}
}
@Override
public void onCompleteResponse(ChatResponse response) {
latch.countDown();
}
@Override
public void onError(Throwable error) {
System.out.println("\nStream stopped.");
latch.countDown();
}
});
latch.await();
}
}
OutputStreaming started... The prime numbers between 1 and 13 are: 2, 3, 5, 7 [Condition met. Cancelling...]
Conclusion
By utilizing the StreamingHandle within onPartialResponse, you can proactively terminate an LLM request once specific conditions are met. This pattern ensures your LangChain4j integration remains efficient, saving resources and reducing latency by calling cancel() as soon as the required information is received or a specific criteria is detected.
Example ProjectDependencies and Technologies Used: - langchain4j 1.10.0 (Build LLM-powered applications in Java: chatbots, agents, RAG, and much more)
- langchain4j-ollama 1.10.0 (LangChain4j :: Integration :: Ollama)
- JDK 17
- Maven 3.9.11
|