2025, Nov 02 19:00
Fix discord.sinks import errors in discord.py: use py-cord[voice] MP3Sink for reliable Discord voice recording and translation bots
Stuck on ModuleNotFoundError: discord.sinks in discord.py? Learn why it happens, how to switch to py-cord[voice] for MP3Sink recording, plus alternatives.
Building a voice translator for Discord looks straightforward until you hit the wall of audio capture. A typical symptom is an import error around discord.sinks while using discord.py, and that stops the whole pipeline before ASR, MT and TTS even get a chance to run. Here is how this mismatch happens and what to do about it without rewriting your entire logic.
Problem recap
The goal was to capture voice in a channel, transcribe it with Whisper, translate it with NLLB, and play back synthesized speech. The implementation relied on MP3Sink from discord.sinks, but the runtime failed right at import with ModuleNotFoundError when installed with discord.py.
ModuleNotFoundError: No module named 'discord.sinks'
Minimal example that triggers the issue
The following code demonstrates the exact shape of the problem. It attempts to import MP3Sink from discord.sinks and start recording after connecting to a voice channel.
import os
import asyncio
import logging
from io import BytesIO
import discord
from discord.ext import commands
from discord.sinks import MP3Sink
from transformers import (
WhisperForConditionalGeneration,
WhisperProcessor,
AutoTokenizer,
AutoModelForSeq2SeqLM
)
from gtts import gTTS
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)
BOT_TOKEN = "---------------------"
VOICE_CHAN_ID = 937370496989806607
TEXT_CHAN_ID = 937370496989806605
WHISPER_NAME = "openai/whisper-large-v2"
NLLB_NAME = "facebook/nllb-200-distilled-600M"
SRC_LANG = "eng_Latn"
DST_LANG = "rus_Cyrl"
class ChannelTranslator(commands.Bot):
def __init__(self, **kwargs):
intents = discord.Intents.all()
super().__init__(command_prefix='!', intents=intents)
self.vc = None
self.asr_processor = None
self.asr_model = None
self.mt_tokenizer = None
self.mt_model = None
self.buf_queue = asyncio.Queue()
async def setup_hook(self):
await self._load_nlp()
await self._join_voice()
self.loop.create_task(self._audio_loop())
async def _load_nlp(self):
log.info("Loading models...")
self.asr_processor = WhisperProcessor.from_pretrained(WHISPER_NAME)
self.asr_model = WhisperForConditionalGeneration.from_pretrained(WHISPER_NAME).to('cuda')
self.mt_tokenizer = AutoTokenizer.from_pretrained(NLLB_NAME)
self.mt_model = AutoModelForSeq2SeqLM.from_pretrained(NLLB_NAME).to('cuda')
log.info("Models ready.")
async def _join_voice(self):
ch = self.get_channel(VOICE_CHAN_ID)
if ch:
self.vc = await ch.connect()
self.vc.start_recording(
MP3Sink(),
self._on_chunk,
self.loop
)
log.info("Connected and recording started.")
else:
log.warning("Voice channel not found!")
async def _on_chunk(self, sink, payloads, *args):
for uid, pkt in payloads.items():
if uid != self.user.id:
await self.buf_queue.put(pkt.file.read())
async def _audio_loop(self):
while True:
pcm = await self.buf_queue.get()
try:
text = await self._asr(pcm)
if text.strip():
await self._translate_and_say(text)
except Exception as ex:
log.error(f"Audio handling error: {ex}")
async def _asr(self, wav_bytes):
from transformers import pipeline
tmp = "temp_input.wav"
with open(tmp, "wb") as fh:
fh.write(wav_bytes)
pipe = pipeline(
"automatic-speech-recognition",
model=self.asr_model,
tokenizer=self.asr_processor,
device=0
)
out = pipe(tmp)
os.remove(tmp)
return out['text']
def _mt(self, text):
self.mt_tokenizer.src_lang = SRC_LANG
enc = self.mt_tokenizer(text, return_tensors="pt").to('cuda')
bos = self.mt_tokenizer.lang_code_to_id[DST_LANG]
ids = self.mt_model.generate(**enc, forced_bos_token_id=bos)
return self.mt_tokenizer.batch_decode(ids, skip_special_tokens=True)[0]
def _tts(self, text, lang='ru'):
tts = gTTS(text=text, lang=lang)
buf = BytesIO()
tts.write_to_fp(buf)
buf.seek(0)
return buf
async def _translate_and_say(self, text):
phrase = self._mt(text)
log.info(f"Translated: {phrase}")
stream = self._tts(phrase)
await self._speak(stream)
async def _speak(self, stream):
src = discord.FFmpegPCMAudio(stream, pipe=True)
if self.vc.is_playing():
self.vc.stop()
self.vc.play(src)
async def on_ready(self):
log.info(f"Logged in as {self.user}")
bot = ChannelTranslator()
bot.run(BOT_TOKEN)
Why the error happens
The import fails because discord.sinks is not part of discord.py. The sinks API, including MP3Sink, is available in pycord. Installing discord.py and then importing from discord.sinks produces the exact ModuleNotFoundError shown above. This is not a misconfiguration of your code; it is a library mismatch.
Solution
Use pycord if you want to rely on discord.sinks. The correct package name for installation is py-cord, not pycord, and you need the voice extras. After installing py-cord[voice], the same import works as intended and the rest of the pipeline can proceed.
pip install py-cord[voice]
If you want to stay with discord.py, there are third-party extensions that add audio receiving capabilities, such as discord-ext-audiorec and discord-ext-voice-recv. They may require further adjustments to your code, and they can be outdated. The choice depends on whether you prefer first-party support for sinks in pycord, or you are ready to adapt external modules to discord.py.
Fixed example with pycord
The core logic below is the same as in the problematic snippet. The crucial change is the expectation that you run it under py-cord[voice], which provides discord.sinks.MP3Sink.
import os
import asyncio
import logging
from io import BytesIO
import discord
from discord.ext import commands
from discord.sinks import MP3Sink
from transformers import (
WhisperForConditionalGeneration,
WhisperProcessor,
AutoTokenizer,
AutoModelForSeq2SeqLM
)
from gtts import gTTS
from transformers import pipeline
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)
API_TOKEN = "---------------------"
VC_ID = 937370496989806607
TC_ID = 937370496989806605
ASR_MODEL_ID = "openai/whisper-large-v2"
MT_MODEL_ID = "facebook/nllb-200-distilled-600M"
LANG_IN = "eng_Latn"
LANG_OUT = "rus_Cyrl"
class AudioRelayBot(commands.Bot):
def __init__(self, **kwargs):
intents = discord.Intents.all()
super().__init__(command_prefix='!', intents=intents)
self.vclient = None
self.whisper_tok = None
self.whisper_net = None
self.nllb_tok = None
self.nllb_net = None
self.queue_audio = asyncio.Queue()
async def setup_hook(self):
await self._boot_models()
await self._attach_voice()
self.loop.create_task(self._worker())
async def _boot_models(self):
log.info("Loading models...")
self.whisper_tok = WhisperProcessor.from_pretrained(ASR_MODEL_ID)
self.whisper_net = WhisperForConditionalGeneration.from_pretrained(ASR_MODEL_ID).to('cuda')
self.nllb_tok = AutoTokenizer.from_pretrained(MT_MODEL_ID)
self.nllb_net = AutoModelForSeq2SeqLM.from_pretrained(MT_MODEL_ID).to('cuda')
log.info("Models loaded.")
async def _attach_voice(self):
vc = self.get_channel(VC_ID)
if vc:
self.vclient = await vc.connect()
self.vclient.start_recording(
MP3Sink(),
self._on_voice,
self.loop
)
log.info("Connected and started recording.")
else:
log.warning("Voice channel not found!")
async def _on_voice(self, sink, packets, *args):
for uid, blob in packets.items():
if uid != self.user.id:
await self.queue_audio.put(blob.file.read())
async def _worker(self):
while True:
buff = await self.queue_audio.get()
try:
phrase = await self._do_asr(buff)
if phrase.strip():
await self._do_translate_and_speak(phrase)
except Exception as err:
log.error(f"Processing error: {err}")
async def _do_asr(self, raw_bytes):
tmp_name = "temp_input.wav"
with open(tmp_name, "wb") as fx:
fx.write(raw_bytes)
recog = pipeline(
"automatic-speech-recognition",
model=self.whisper_net,
tokenizer=self.whisper_tok,
device=0
)
res = recog(tmp_name)
os.remove(tmp_name)
return res['text']
def _do_translate(self, txt):
self.nllb_tok.src_lang = LANG_IN
tb = self.nllb_tok(txt, return_tensors="pt").to('cuda')
bos_id = self.nllb_tok.lang_code_to_id[LANG_OUT]
out_ids = self.nllb_net.generate(**tb, forced_bos_token_id=bos_id)
return self.nllb_tok.batch_decode(out_ids, skip_special_tokens=True)[0]
def _do_tts(self, txt, lang='ru'):
t = gTTS(text=txt, lang=lang)
bio = BytesIO()
t.write_to_fp(bio)
bio.seek(0)
return bio
async def _do_translate_and_speak(self, txt):
translated = self._do_translate(txt)
log.info(f"Translation: {translated}")
pcm = self._do_tts(translated)
await self._play_audio(pcm)
async def _play_audio(self, fp):
stream = discord.FFmpegPCMAudio(fp, pipe=True)
if self.vclient.is_playing():
self.vclient.stop()
self.vclient.play(stream)
async def on_ready(self):
log.info(f"Bot online as {self.user}")
app = AudioRelayBot()
app.run(API_TOKEN)
About the “Command not found” log
After switching libraries, a separate issue can surface where a message says Command "join" is not found. The runtime log shows the bot is connected and then rejects the join command. This points to the command not being registered rather than to the audio stack. When commands are defined inside classes, they must be registered correctly so the framework knows about them. This is a different concern from sinks and should be addressed on its own.
Why this matters
Voice receive, sinks, and recording are not universally available across Discord Python libraries. Relying on an API that your installed package doesn’t ship guarantees runtime failures that look like missing modules. Clarifying which ecosystem you target—discord.py with third‑party extensions or pycord with built‑in sinks—saves hours of debugging and avoids accidental coupling to undocumented behaviors.
Takeaways
If you need discord.sinks and MP3Sink, install py-cord with voice extras and continue using the same recording logic. If you prefer staying on discord.py, consider external modules like discord-ext-audiorec or discord-ext-voice-recv, keeping in mind they may need more code changes and can be outdated. Treat command registration issues separately from audio capture, and always keep full error messages; they often contain the single line that reveals the exact source of the problem.
The article is based on a question from StackOverflow by Кирилл Скляров and an answer by furas.