试试在 Emacs 中用语音来用 GPT

xuchunyang · 2025 年3 月 21 日 08:58

折腾了下在 Emacs 中使用语音接入 GPT，通过语音提问，然后语音朗读回答，具体思路为：

用 ffmpeg 录音
用 gpt-4o-transcribe 把录音转换为文字
用 gpt-4o 回答
用 gpt-4o-mini-tts 把文字转换为语音
用 ffplay 播放语音

全部代码如下，我在 Mac 上测试过可以跑，其他平台估计需要改 -f avfoundation

(defun chunyang-llm-ask ()
  "通过语音使用 GPT.

1. 用 ffmpeg 录音
2. 用 gpt-4o-transcribe 把录音转换为文字
3. 用 gpt-4o 回答
4. 用 gpt-4o-mini-tts 把文字转换为语音
5. 用 ffplay 播放语音"
  (interactive)
  (let ((question (chunyang-llm--audio-to-text
		   (chunyang-llm--record-audio))))
    (message "You: %s" question)
    (let ((answer (chunyang-llm--responses "gpt-4o" question)))
      (message "GPT: %s" answer)
      (chunyang-llm--play-audio (chunyang-llm--text-to-audio answer)))))

(defun chunyang-llm--responses (model input &optional instructions)
  (let ((data (plz 'post
		"https://api.openai.com/v1/responses"
		;; "http://localhost:4444"
		:headers `(("Authorization" . ,(format "Bearer %s" (chunyang-llm--openai-token)))
			   ("Content-Type" . "application/json"))
		:body (json-encode
		       `((model . ,model)
			 ,@(and instructions (list (cons instructions instructions)))
			 (input . ,input)))
		:as #'json-read
		:connect-timeout 15)))
    (alist-get 'text (aref (alist-get 'content (aref (alist-get 'output data) 0)) 0))))

(defun chunyang-llm--openai-token ()
  (auth-source-pick-first-password :host "api.openai.com" :user "apikey"))

(defun chunyang-llm--record-audio ()
  (let* ((output-file (concat (make-temp-file "audio-") ".wav"))
	 (process
	  (start-process "ffmpeg" (generate-new-buffer " *ffmpeg*")
			 "ffmpeg"
			 "-f" "avfoundation"
			 "-i" ":0"
			 "-y"
			 output-file)))
    (read-key "Recording audio. Press any key to finish.")
    (kill-process process)
    (sit-for .1)
    output-file))

(defun chunyang-llm--audio-to-text (audio-file)
  (with-temp-buffer
    (call-process
     "curl" nil t nil
     "https://api.openai.com/v1/audio/transcriptions"
     "-s"
     "--fail"
     "-H" (format "Authorization: Bearer %s" (chunyang-llm--openai-token))
     "-H" "Content-Type: multipart/form-data"
     "-F" (format "file=@%s" audio-file)
     "-F" "model=gpt-4o-transcribe")
    ;; {"text":"Hello, this is it."}
    (gethash "text" (json-parse-string (buffer-string)))))

(defun chunyang-llm--text-to-audio (string)
  (let ((output-file (concat (make-temp-file "audio-") ".wav")))
    (delete-file output-file)
    (plz 'post "https://api.openai.com/v1/audio/speech"
      :headers `(("Authorization" . ,(format "Bearer %s" (chunyang-llm--openai-token)))
		 ("Content-Type" . "application/json"))
      :body (json-serialize
	     `((model . "gpt-4o-mini-tts")
	       (input . ,string)
	       (voice . "alloy")
	       (response_format . "wav")))
      :as `(file ,output-file))
    output-file))

(defun chunyang-llm--play-audio (audio-file)
  (start-process "ffplay" (generate-new-buffer " *ffplay*")
		 "ffplay" "-nodisp" "-autoexit" audio-file))

TomoeMami · 2025 年3 月 21 日 10:21

响应速度如何

zsxh · 2025 年3 月 21 日 10:57

在 mac 下允许 emacs 获取麦克风权限和 ffmpeg 获取输入设备，可以参考下 MacOS Configuration · natrys/whisper.el Wiki · GitHub

xuchunyang · 2025 年3 月 21 日 12:28

比较慢的，一次提问相当于三次请求，都不支持 stream，特别是回答比较长的时候，需要等全部结束才能说话，虽然可以在 minibuffer 中看到文字输出。

xuchunyang · 2025 年3 月 21 日 12:32

一开始我也遇到了，Emacs 并不会弹出申请权限的弹窗，查了下也没解决，最后改用了 emacs-plus 就行了：

brew uninstall --cask emacs

brew tap d12frosted/emacs-plus
# 第三方包，没有二进制，需要本地编译，比较慢
brew install emacs-plus

xuchunyang · 2025 年3 月 21 日 12:40

上面的方法比较通用，适合各种模型，更加直接的方法是用 gpt-4o-audio-preview 这种原生支持语音的模型，可以一步到位，不过我还没试过，有点担心费用，毕竟直接输出语音二进制。

manateelazycat · 2025 年3 月 21 日 14:52

语音转文字，文字理解再转语音，响应时间太长了。

最好用GPT原生理解语音，语音回答，而且最好是边理解便回答的那种。

语音转文字只适合语音遥控板或者语音笔记这种。

xuchunyang · 2025 年3 月 21 日 17:00

试了下直接调用支持语音的模型 gpt-4o-audio-preview，速度快了不少，虽然并不支持 stream 输出语音。

(defun chunyang-llm-ask-v2 ()
  "调用 gpt-4o-audio-preview 模型，语音提问，语音回答."
  (interactive)
  (require 'plz)
  (let ((input-audio (chunyang-llm--record-audio))
	;; or gpt-4o-mini-audio-preview
	(model "gpt-4o-audio-preview"))
    (let ((data (plz 'post
		  "https://api.openai.com/v1/chat/completions"
		  :headers `(("Authorization" . ,(format "Bearer %s" (chunyang-llm--openai-token)))
			     ("Content-Type" . "application/json"))
		  :body (with-current-buffer (generate-new-buffer " *chunyang-llm-json*")
			  (insert
			   (json-encode
			    `((model . "gpt-4o-mini-audio-preview")
			      (modalities . ["text" "audio"])
			      (audio . ((voice . "alloy") ("format" . "wav")))
			      (messages . [((role . "user")
					    (content . [((type . "input_audio")
							 (input_audio . ((data . ,(with-temp-buffer
										    (insert-file-contents-literally input-audio)
										    (base64-encode-region (point-min) (point-max))
										    (buffer-string)))
									 (format . "wav"))))]))]))))
			  (current-buffer))
		  :as #'json-read
		  :connect-timeout 30)))
      (let-alist (aref (alist-get 'choices data) 0)
	(let ((output-audio (concat (make-temp-file "audio-") ".wav")))
	  (with-temp-buffer
	    (set-buffer-multibyte nil)
	    (insert .message.audio.data)
	    (base64-decode-region (point-min) (point-max))
	    (write-region nil nil output-audio))
	  (message "GPT: %s" .message.audio.transcript)
	  (chunyang-llm--play-audio output-audio))))))