Large Model Streaming Call Specification (SSE)

With the widespread application of large language models, how to efficiently interface calls with them has become a key issue. The traditional request-response mode has problems such as high response delay and poor user experience when generating large amounts of text in the face of large models. Streaming is an important means to solve this problem.

This article will introduce it based onServer-Sent Events（SSE）The protocol's big model streaming call specification and combined withSpring BootA complete server-side and client call example is given.

1. Why choose SSE?

When talking to a big model, the model usually generates content word by word. If a traditional HTTP request is used, you need to wait until the model has generated all the content before responding to the client, resulting in a high latency. Using the SSE protocol can achieve generation and push, greatly improving interactivity and user experience.

Advantages of SSE:

One-way connection: the server actively pushes, and the client automatically receives;
Using HTTP protocol, browser native support;
Simple implementation and suitable for streaming text output scenarios.

2. Streaming call interface specification (based on SSE)

Request method

method：POST
Content-Type：application/json
Accept：text/event-stream

Request Example

Response Format (SSE Stream)

Each line withdata: Start with JSON string;
The last line withdata: [DONE]Indicates that the stream ends;
The client needs to parse the receivedcontentFields and display.

3. Spring Boot server example

Below is an example of an implementation of SSE streaming interface based on Spring Boot.

1. Controller layer

@RestController
@RequestMapping("/chat")
public class ChatController {

    @PostMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public SseEmitter streamChat(@RequestBody ChatRequest request) {
        SseEmitter emitter = new SseEmitter(0L); //Don't set the timeout

        (() -> {
            try {
                //Simulate sentence-by-sentence generated responses
                List<String> responses = mockModelResponse(());

                for (String sentence : responses) {
                    Map<String, String> data = new HashMap<>();
                    ("id", "msg_001");
                    ("content", sentence);

                    (SseEmitter.event()
                            .data(new ObjectMapper().writeValueAsString(data)));

                    (500); //Simulation generation delay
                }

                ("data: [DONE]\n\n");
                ();
            } catch (Exception e) {
                (e);
            }
        });

        return emitter;
    }

    private List<String> mockModelResponse(String prompt) {
        return (
                "Romance of the Three Kingdoms is one of the four great classics in ancient China.",
                "It tells the story of the heroes' separatist rule in the late Eastern Han Dynasty.",
                "The main characters include Liu Bei, Guan Yu, Zhang Fei, Cao Cao, Sun Quan, etc."
        );
    }
}

2. Request class definition

4. Client call example (Java)

Client streaming with Spring WebFlux:

WebClient client = ();

()
    .uri("http://localhost:8080/chat/stream")
    .header(, MediaType.TEXT_EVENT_STREAM_VALUE)
    .bodyValue(("prompt", "Introduce the Romance of the Three Kingdoms", "stream", true))
    .retrieve()
    .bodyToFlux()
    .doOnNext(::println)
    .blockLast();

V. Summary and Suggestions

Large-model streaming calls based on SSE can significantly improve response speed and user experience. Pay attention to when using:

SSE is suitable for text output. If it involves audio/pictures and other content, it is recommended to use WebSocket;
Exceptions and resource release should be considered when processing on the server;
Clients need to have real-time processing and splicing capabilities.