Unit-based Audiovisual Translation for Korean
Text-free Direct Speech Translation with Synchronized Lip Movement
NetfLipsλ μμ΄ μμμ μ λ ₯λ°μ μμ±κ³Ό μ λͺ¨μμ΄ λκΈ°νλ νκ΅μ΄ λ²μ μμμ μμ±νλ νλ‘μ νΈμ λλ€.
- π― Unit-based Translation: ν μ€νΈ μ€κ° νν μμ΄ μμ±κ³Ό μκ° μ 보λ₯Ό κ³΅ν΅ μ λ(Unit) ννμΌλ‘ μ§μ λͺ¨λΈλ§
- π Speech & Visual Sync: μμ±κ³Ό λΉλμ€λ₯Ό κ³΅ν΅ νΉμ§ 곡κ°μ Unit λ¨μλ‘ μ λ ¬νμ¬ κ°κ±΄ν λ²μ ꡬν
- π°π· Korean Fine-tuning: κΈ°μ‘΄μ μ§μλμ§ μλ νκ΅μ΄ capabilityλ₯Ό μν Fine-tuning
- π¬ Natural Synthesis: μμ°μ€λ¬μ΄ μμ± ν©μ± λ° λ¦½μ±ν¬ μμ±
#Unit-based Audiovisual Translation #Text-free Direct Speech Translation #Lip Sync #Speech Translation
π Demo Link
NetfLipsλ 3λ¨κ³ νμ΄νλΌμΈμΌλ‘ ꡬμ±λ©λλ€:
- FLAC 볡μ (wav)
- νΉμ§ μΆμΆ (Mel Spectrogram)
- K-means λΆλ₯
- μ μ sequenceλ‘ λ³ν
- Base Model: AV2AV (Choi, J., et al., 2024)
- Translation: μμ΄ unit β νκ΅μ΄ unit
- Framework: Fairseq toolkit κΈ°λ° unit sequence νμ΅
- Backbone: λκ·λͺ¨ μ¬μ νμ΅ λͺ¨λΈ mBART νμ©
- Unit β Audio λ³ν
- νκ΅μ΄ unit & νμ μλ² λ© νμ©
- Speech Resynthesis
λ³Έ νλ‘μ νΈλ λ€μ λ°μ΄ν°μ μ νμ©νμ¬ νμ΅λμμ΅λλ€:
| Dataset | Description | Size |
|---|---|---|
| Zeroth Korean ASR | νκ΅μ΄ μμ± μΈμ λ°μ΄ν° | 12,245 λ¬Έμ₯ |
| AIHub Ko-X ν΅λ²μ μμ± | νκ΅μ΄-μμ΄(λ―Έκ΅) λ³λ ¬ μμ± λ°μ΄ν° | 169,488 λ¬Έμ₯ |
# νμν ν¨ν€μ§ λ° νκ²½ μ€μ (μΆν μ
λ°μ΄νΈ)# μ€μΉ λ°©λ² (μΆν μ
λ°μ΄νΈ)# μ¬μ© μμ μ½λ (μΆν μ
λ°μ΄νΈ)# 컀맨λλΌμΈ μ¬μ©λ² (μΆν μ
λ°μ΄νΈ)NetfLips/
βββ # μΆν μ
λ°μ΄νΈ
βββ
βββ
βββ
βββ
βββ
βββ README.md
- FLAC νμΌ λ³΅μ λ° wav λ³ν
- Mel Spectrogram κΈ°λ° νΉμ§ μΆμΆ
- K-means ν΄λ¬μ€ν°λ§μ ν΅ν Unit λΆλ₯
- mBART κΈ°λ° sequence-to-sequence νμ΅
- Fairseq toolkit νμ©
- Unit-to-Unit translation μ΅μ ν
- νκ΅μ΄ unitμμ μμ± μ¬ν©μ±
- νμ μλ² λ©μ νμ©ν μμ°μ€λ¬μ΄ μμ± μμ±
- 립μ±ν¬κ° λκΈ°νλ λΉλμ€ μμ±
- AV2AV: Audio-Visual to Audio-Visual translation model
- Reference: Choi, J., et al., 2024
- νκ΅μ΄ λ―Έμ§μ λ¬Έμ ν΄κ²°μ μν Fine-tuning
- λ³λ ¬ ν-μ μμ± λ°μ΄ν° νμ©
- Unit-level translation νμ΅
| Name | batch |
|---|---|
| μ₯μ§μ | 6th |
| μ μ§ν | 6th |
| μ κ·μ² | 8th |
| μ΄κ°μ° | 8th |
@misc{netflips2024,
title={NetfLips: Unit-based Audiovisual Translation for Korean},
author={μ₯μ§μ, μ μ§ν, μ κ·μ² , μ΄κ°μ°},
year={2024}
}- Choi, J., et al. (2024). AV2AV: Audio-Visual to Audio-Visual Translation
μ΄ νλ‘μ νΈλ MIT λΌμ΄μ μ€ νμ λ°°ν¬λ©λλ€. μμΈν λ΄μ©μ LICENSE νμΌμ μ°Έμ‘°νμΈμ.
This repository is built upon AV2AV and Fairseq. We appreciate the open-source of the projects.