-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat(tts): implement tts #965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
336e5c9
wip: implement tts
qbc2016 97bc2eb
implement in print
qbc2016 e7186b8
remove tts in agentbase
qbc2016 0842d9c
modify
qbc2016 4b5de17
update
qbc2016 3897bbf
add kwargs in call
qbc2016 4840f79
remove format in DashScopeRealtimeTTSModel
qbc2016 12f40c4
Remove old audio blocks before adding the final one
qbc2016 39d0899
modify according to comments
qbc2016 f03d137
add literal voice
qbc2016 81f7fea
add readme
qbc2016 77388b4
update
qbc2016 4889a6c
refactor(tts): refactor the tts model
DavdGao 7153317
fix
DavdGao b0cb4da
add english tutorial for tts
qbc2016 860d1fd
update
qbc2016 95f7e23
support stream
qbc2016 2003755
add tts tests
qbc2016 43784e2
docs(tts): add Chinese version and fix some typos (#5)
DavdGao b525dcb
add chinese tutorial
qbc2016 20f17c4
bug fix
qbc2016 c2b64d2
close
qbc2016 f52321a
Use `speech` argument in the print method for the "Separation of Conc…
DavdGao bf8b5c1
Merge remote-tracking branch 'agentscope/main' into bc/tts
DavdGao dd5225c
finish
DavdGao b8148b5
fix error in unittests
DavdGao File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,243 @@ | ||
| # -*- coding: utf-8 -*- | ||
| """ | ||
| .. _tts: | ||
|
|
||
| TTS | ||
| ==================== | ||
|
|
||
| AgentScope provides a unified interface for Text-to-Speech (TTS) models across multiple API providers. | ||
| This tutorial demonstrates how to use TTS models in AgentScope. | ||
|
|
||
| AgentScope supports the following TTS APIs: | ||
|
|
||
| .. list-table:: Built-in TTS Models | ||
| :header-rows: 1 | ||
|
|
||
| * - API | ||
| - Class | ||
| - Streaming Input | ||
| - Non-Streaming Input | ||
| - Streaming Output | ||
| - Non-Streaming Output | ||
| * - DashScope Realtime API | ||
| - ``DashScopeRealtimeTTSModel`` | ||
| - ✅ | ||
| - ✅ | ||
| - ✅ | ||
| - ✅ | ||
| * - DashScope API | ||
| - ``DashScopeTTSModel`` | ||
| - ❌ | ||
| - ✅ | ||
| - ✅ | ||
| - ✅ | ||
| * - OpenAI API | ||
| - ``OpenAITTSModel`` | ||
| - ❌ | ||
| - ✅ | ||
| - ✅ | ||
| - ✅ | ||
| * - Gemini API | ||
| - ``GeminiTTSModel`` | ||
| - ❌ | ||
| - ✅ | ||
| - ✅ | ||
| - ✅ | ||
|
|
||
| .. note:: The streaming input and output in AgentScope TTS models are all accumulative. | ||
|
|
||
| **Choosing the Right Model:** | ||
|
|
||
| - **Use Non-Realtime TTS** when you have complete text ready (e.g., pre-written | ||
| responses, complete LLM outputs) | ||
| - **Use Realtime TTS** when text is generated progressively (e.g., streaming | ||
| LLM responses) for lower latency | ||
|
|
||
| """ | ||
|
|
||
| import asyncio | ||
| import os | ||
|
|
||
| from agentscope.agent import ReActAgent, UserAgent | ||
| from agentscope.formatter import DashScopeChatFormatter | ||
| from agentscope.message import Msg | ||
| from agentscope.model import DashScopeChatModel | ||
| from agentscope.tts import ( | ||
| DashScopeRealtimeTTSModel, | ||
| DashScopeTTSModel, | ||
| ) | ||
|
|
||
| # %% | ||
| # Non-Realtime TTS | ||
| # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| # Non-realtime TTS models process complete text inputs and are the simplest | ||
| # to use. You can directly call their ``synthesize()`` method. | ||
| # | ||
| # Taking DashScope TTS model as an example: | ||
|
|
||
|
|
||
| async def example_non_realtime_tts() -> None: | ||
| """A basic example of using non-realtime TTS models.""" | ||
| # Example with DashScope TTS | ||
| tts_model = DashScopeTTSModel( | ||
| api_key=os.environ.get("DASHSCOPE_API_KEY", ""), | ||
| model_name="qwen3-tts-flash", | ||
| voice="Cherry", | ||
| stream=False, # Non-streaming output | ||
| ) | ||
|
|
||
| msg = Msg( | ||
| name="assistant", | ||
| content="Hello, this is DashScope TTS.", | ||
| role="assistant", | ||
| ) | ||
|
|
||
| # Directly synthesize without connecting | ||
| tts_response = await tts_model.synthesize(msg) | ||
|
|
||
| # tts_response.content contains an audio block with base64-encoded audio data | ||
| print( | ||
| "The length of audio data:", | ||
| len(tts_response.content[0]["source"]["data"]), | ||
| ) | ||
|
|
||
|
|
||
| asyncio.run(example_non_realtime_tts()) | ||
|
|
||
| # %% | ||
| # **Streaming Output for Lower Latency:** | ||
| # | ||
| # When ``stream=True``, the model returns audio chunks progressively, allowing | ||
| # you to start playback before synthesis completes. This reduces perceived latency. | ||
| # | ||
|
|
||
|
|
||
| async def example_non_realtime_tts_streaming() -> None: | ||
| """An example of using non-realtime TTS models with streaming output.""" | ||
| # Example with DashScope TTS with streaming output | ||
| tts_model = DashScopeTTSModel( | ||
| api_key=os.environ.get("DASHSCOPE_API_KEY", ""), | ||
| model_name="qwen3-tts-flash", | ||
| voice="Cherry", | ||
| stream=True, # Enable streaming output | ||
| ) | ||
|
|
||
| msg = Msg( | ||
| name="assistant", | ||
| content="Hello, this is DashScope TTS with streaming output.", | ||
| role="assistant", | ||
| ) | ||
|
|
||
| # Synthesize and receive an async generator for streaming output | ||
| async for tts_response in await tts_model.synthesize(msg): | ||
| # Process each audio chunk as it arrives | ||
| print( | ||
| "Received audio chunk of length:", | ||
| len(tts_response.content[0]["source"]["data"]), | ||
| ) | ||
|
|
||
|
|
||
| asyncio.run(example_non_realtime_tts_streaming()) | ||
|
|
||
|
|
||
| # %% | ||
| # Realtime TTS | ||
| # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| # Realtime TTS models are designed for scenarios where text is generated | ||
| # incrementally, such as streaming LLM responses. This enables the lowest | ||
| # possible latency by starting audio synthesis before the complete text is ready. | ||
| # | ||
| # **Key Concepts:** | ||
| # | ||
| # - **Stateful Processing**: Realtime TTS maintains state for a single streaming | ||
| # session, identified by ``msg.id``. Only one streaming session can be active | ||
| # at a time. | ||
| # - **Two Methods**: | ||
| # | ||
| # - ``push(msg)``: Non-blocking method that submits text chunks and returns | ||
| # immediately. May return partial audio if available. | ||
| # - ``synthesize(msg)``: Blocking method that finalizes the session and returns | ||
| # all remaining audio. When ``stream=True``, it returns an async generator. | ||
| # | ||
| # .. code-block:: python | ||
| # | ||
| # async def example_realtime_tts_streaming(): | ||
| # tts_model = DashScopeRealtimeTTSModel( | ||
| # api_key=os.environ.get("DASHSCOPE_API_KEY", ""), | ||
| # model_name="qwen3-tts-flash-realtime", | ||
| # voice="Cherry", | ||
| # stream=False, | ||
| # ) | ||
| # | ||
| # # realtime tts model received accumulative text chunks | ||
| # res = await tts_model.push(msg_chunk_1) # non-blocking | ||
| # res = await tts_model.push(msg_chunk_2) # non-blocking | ||
| # ... | ||
| # res = await tts_model.synthesize(final_msg) # blocking, get all remaining audio | ||
| # | ||
| # When setting ``stream=True`` during initialization, the ``synthesize()`` method returns an async generator of ``TTSResponse`` objects, allowing you to process audio chunks as they arrive. | ||
| # | ||
| # | ||
| # Integrating with ReActAgent | ||
| # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| # AgentScope agents can automatically synthesize their responses to speech | ||
| # when provided with a TTS model. This works seamlessly with both realtime | ||
| # and non-realtime TTS models. | ||
| # | ||
| # **How It Works:** | ||
| # | ||
| # 1. The agent generates a text response (potentially streamed from an LLM) | ||
| # 2. The TTS model synthesizes the text to audio automatically | ||
| # 3. The synthesized audio is attached to the ``speech`` field of the ``Msg`` object | ||
| # 4. The audio is played during the agent's ``self.print()`` method | ||
| # | ||
|
|
||
|
|
||
| async def example_agent_with_tts() -> None: | ||
| """An example of using TTS with ReActAgent.""" | ||
| # Create an agent with TTS enabled | ||
| agent = ReActAgent( | ||
| name="Assistant", | ||
| sys_prompt="You are a helpful assistant.", | ||
| model=DashScopeChatModel( | ||
| api_key=os.environ.get("DASHSCOPE_API_KEY", ""), | ||
| model_name="qwen-max", | ||
| stream=True, | ||
| ), | ||
| formatter=DashScopeChatFormatter(), | ||
| # Enable TTS | ||
| tts_model=DashScopeRealtimeTTSModel( | ||
| api_key=os.getenv("DASHSCOPE_API_KEY"), | ||
| model_name="qwen3-tts-flash-realtime", | ||
| voice="Cherry", | ||
| ), | ||
| ) | ||
| user = UserAgent("User") | ||
|
|
||
| # Build a conversation just like normal | ||
| msg = None | ||
| while True: | ||
| msg = await agent(msg) | ||
| msg = await user(msg) | ||
| if msg.get_text_content() == "exit": | ||
| break | ||
|
|
||
|
|
||
| # %% | ||
| # Customizing TTS Model | ||
| # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| # You can create custom TTS implementations by inheriting from ``TTSModelBase``. | ||
| # The base class provides a flexible interface for both realtime and non-realtime | ||
| # TTS models. | ||
| # We use an attribute ``supports_streaming_input`` to indicate if the TTS model is realtime or not. | ||
| # | ||
| # For realtime TTS models, you need to implement the ``connect``, ``close``, ``push`` and ``synthesize`` methods to handle the lifecycle and streaming input. | ||
| # | ||
| # While for non-realtime TTS models, you only need to implement the ``synthesize`` method. | ||
| # | ||
| # Further Reading | ||
| # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| # - :ref:`agent` - Learn more about agents in AgentScope | ||
| # - :ref:`message` - Understand message format in AgentScope | ||
| # - API Reference: :class:`agentscope.tts.TTSModelBase` | ||
| # |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.