AI/models

Text-To-Speech models

728x90

내가 더 알아보고 싶은 모델 또는 서비스 기록 (추후 포스팅)

연세대 & 네이버 LiteTTS

ISCA Archive

LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks Huu-Kim Nguyen, Kihyuk Jeong, Seyun Um, Min-Jae Hwang, Eunwoo Song, Hong-Goo Kang In this paper, we propose a lightweight end-to-end text-to-speec

www.isca-speech.org

가벼운 버전의 Feed Forward Transformers + HiFi-GAN
domain transfer encoder를 두어 텍스트의 정보도 prosody 임베딩과 연관시켜 학습시키는 부분을 별도로 두었음

마이크로소프트 DelightfulTTS2

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

Current text to speech (TTS) systems usually leverage a cascaded acoustic model and vocoder pipeline with mel-spectrograms as the intermediate representations, which suffer from two limitations: 1) the acoustic model and vocoder are separately trained inst

arxiv.org

Fully End-to-End 방식
중간 표현법으로 VQ audio encoder를 이용

엔비디아 NeMo

GitHub - NVIDIA/NeMo: NeMo: a toolkit for conversational AI

NeMo: a toolkit for conversational AI. Contribute to NVIDIA/NeMo development by creating an account on GitHub.

github.com

인간 목소리를 악기로 보고 합성된 목소리 피치와 지속 시간, 강도를 프레임 단위로 정확하게 제어

구글 AudioLM

AudioLM: a Language Modeling Approach to Audio Generation

We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how e

arxiv.org

오디오 프롬프트를 기반으로 현실적인 음성 또는 피아노 멜로디를 생성하는 언어모델
3초 분량의 짧은 오디오 파일만 입력받아도 그 다음 부분을 스스로 이어나감
스피치의 경우 자연스럽게 문장을 생성하면서 어조와 말투를 유지(Speech Continuation)하고, 음악의 경우 멜로디를 자연스럽게 이어나감(Music Continuation).

구글 MusicLM

MusicLM

MusicLM: Generating Music From Text |paper|dataset| Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank

google-research.github.io

AudioLM의 음악 버전
28만 시간의 음악을 학습하여 복잡다단한 음악을 생성

구글 SingSong

SingSong: Generating musical accompaniments from singing (Online Supplement)

SingSong: Generating musical accompaniments from singing Online Supplement --> |paper| Chris Donahue*, Antoine Caillon*1, Adam Roberts*, Ethan Manilow, Philippe Esling1, Andrea Agostinelli, Mauro Verzetti, Ian Simon, Olivier Pietquin, Neil Zeghidour, Jesse

storage.googleapis.com

노래를 부르면 그에 맞춰서 배경음악을 만들어주는 모델

KT 마이 AI 보이스

나만의 AI목소리 제작, 텍스트 음성변환 TTS 사이트 : KT AI 보이스 스튜디오

생생한 감정과 5개 국어가 가능한 AI보이스와 내 목소리로 만드는 마이AI보이스로 유튜브, 오디오북, 안내방송, 도슨트 등 다양한 콘텐츠를 제작해보세요!

aivoicestudio.ai

30문장 이내, 5분 분량의 녹음본으로 생성 가능한 개인화 TTS 서비스

마이크로소프트 VALL-E

VALL-E

VALL-E Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [Paper] Chengyi Wang*, Sanyuan Chen*, Yu Wu*, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng

valle-demo.github.io

어떤 음성이든 3초만 있으면 그 음성으로 합성 가능
파이프라인 : phoneme → discrete codes(이산 코드) → waveform

저작자표시 (새창열림)

'AI > models' 카테고리의 다른 글

Whisper (0)	2023.06.10
Vicuna : ChatGPT 90% 성능을 가진 오픈 소스 챗봇 (0)	2023.05.06
Bark : 트랜스포머 기반 text-to-audio 모델 (0)	2023.04.27
KeyBERT (0)	2023.04.21

Contents

새소식