本来想试试 OpenAI Realtime API 来实现语音到语音对话的,发现实时接口文字到文字也能用,就先试下在 Emacs 中实现:
(defun chunyang-llm-websocket ()
(interactive)
(require 'websocket)
;; (setq websocket-debug t)
(websocket-open
"wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-12-17"
:custom-header-alist
`(("Authorization" . ,(format "Bearer %s" (getenv "OPENAI_API_KEY")))
("OpenAI-Beta" . "realtime=v1"))
:on-message
(lambda (ws frame)
(let-alist (json-parse-string
(websocket-frame-text frame)
:object-type 'alist)
(pcase .type
;; 1. 开始提问(自动包含上下文)
((or "session.created" "response.done")
(let ((prompt (read-string "Prompt (Empty to quit): ")))
(if (not (string= prompt ""))
(websocket-send-text
ws
(json-serialize
`((type . "conversation.item.create")
(item (type . "message")
(role . "user")
(content . [((type . "input_text") (text . ,prompt))])))))
(websocket-close ws))))
;; 2. 请求回复
("conversation.item.created"
(websocket-send-text
ws
(json-serialize '((type . "response.create") (response (modalities . ["text"]))))))
;; 3. 输出结果
("response.text.delta"
(with-current-buffer (get-buffer-create "*chunyang-llm-websocket*")
(display-buffer (current-buffer))
(goto-char (point-max))
(insert .delta))))))))
websocket 接口会自动记住之前的聊天记录。
关于 stream 输出,之前用 plz-event-source 处理 http sse,现在感觉 websocket 从代码上看更简单点。
gpt-4o-realtime 价格是 gpt-4o 的 2 倍。
OpenAI Realtime API 最酷的是能支持像真人电话那样随时打断对方,自动判断结束,有机会再试试看,估计用浏览器的 webrtc 比较好实现。