Skip to content

[2025-2] Textless Direct Audio-Visual Speech Translation

License

Notifications You must be signed in to change notification settings

Prometheus-AI-3team/NetfLips

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎬 NetfLips

Unit-based Audiovisual Translation for Korean
Text-free Direct Speech Translation with Synchronized Lip Movement

License Python


πŸ“‹ Overview

NetfLipsλŠ” μ˜μ–΄ μ˜μƒμ„ μž…λ ₯λ°›μ•„ μŒμ„±κ³Ό μž… λͺ¨μ–‘이 λ™κΈ°ν™”λœ ν•œκ΅­μ–΄ λ²ˆμ—­ μ˜μƒμ„ μƒμ„±ν•˜λŠ” ν”„λ‘œμ νŠΈμž…λ‹ˆλ‹€.

✨ Key Features

  • 🎯 Unit-based Translation: ν…μŠ€νŠΈ 쀑간 ν‘œν˜„ 없이 μŒμ„±κ³Ό μ‹œκ° 정보λ₯Ό 곡톡 μœ λ‹›(Unit) ν‘œν˜„μœΌλ‘œ 직접 λͺ¨λΈλ§
  • πŸ”Š Speech & Visual Sync: μŒμ„±κ³Ό λΉ„λ””μ˜€λ₯Ό 곡톡 νŠΉμ§• κ³΅κ°„μ˜ Unit λ‹¨μœ„λ‘œ μ •λ ¬ν•˜μ—¬ κ°•κ±΄ν•œ λ²ˆμ—­ κ΅¬ν˜„
  • πŸ‡°πŸ‡· Korean Fine-tuning: 기쑴에 μ§€μ›λ˜μ§€ μ•Šλ˜ ν•œκ΅­μ–΄ capabilityλ₯Ό μœ„ν•œ Fine-tuning
  • πŸ’¬ Natural Synthesis: μžμ—°μŠ€λŸ¬μš΄ μŒμ„± ν•©μ„± 및 립싱크 생성

🎯 Keywords

#Unit-based Audiovisual Translation #Text-free Direct Speech Translation #Lip Sync #Speech Translation


πŸŽ₯ Demo

🌐 Demo Link

πŸ—οΈ Architecture

NetfLipsλŠ” 3단계 νŒŒμ΄ν”„λΌμΈμœΌλ‘œ κ΅¬μ„±λ©λ‹ˆλ‹€:

1️⃣ Unit Extraction

  • FLAC 볡원 (wav)
  • νŠΉμ§• μΆ”μΆœ (Mel Spectrogram)
  • K-means λΆ„λ₯˜
  • μ •μˆ˜ sequence둜 λ³€ν™˜

2️⃣ Unit Translation

  • Base Model: AV2AV (Choi, J., et al., 2024)
  • Translation: μ˜μ–΄ unit β†’ ν•œκ΅­μ–΄ unit
  • Framework: Fairseq toolkit 기반 unit sequence ν•™μŠ΅
  • Backbone: λŒ€κ·œλͺ¨ 사전 ν•™μŠ΅ λͺ¨λΈ mBART ν™œμš©

3️⃣ AV Generation

  • Unit β†’ Audio λ³€ν™˜
  • ν•œκ΅­μ–΄ unit & ν™”μž μž„λ² λ”© ν™œμš©
  • Speech Resynthesis

πŸ“Š Dataset

λ³Έ ν”„λ‘œμ νŠΈλŠ” λ‹€μŒ 데이터셋을 ν™œμš©ν•˜μ—¬ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€:

Dataset Description Size
Zeroth Korean ASR ν•œκ΅­μ–΄ μŒμ„± 인식 데이터 12,245 λ¬Έμž₯
AIHub Ko-X ν†΅λ²ˆμ—­ μŒμ„± ν•œκ΅­μ–΄-μ˜μ–΄(λ―Έκ΅­) 병렬 μŒμ„± 데이터 169,488 λ¬Έμž₯

πŸš€ Getting Started

Prerequisites

# ν•„μš”ν•œ νŒ¨ν‚€μ§€ 및 ν™˜κ²½ μ„€μ • (μΆ”ν›„ μ—…λ°μ΄νŠΈ)

Installation

# μ„€μΉ˜ 방법 (μΆ”ν›„ μ—…λ°μ΄νŠΈ)

πŸ’» Usage

Quick Start

# μ‚¬μš© 예제 μ½”λ“œ (μΆ”ν›„ μ—…λ°μ΄νŠΈ)

Advanced Usage

# μ»€λ§¨λ“œλΌμΈ μ‚¬μš©λ²• (μΆ”ν›„ μ—…λ°μ΄νŠΈ)

πŸ“ Project Structure

NetfLips/
β”œβ”€β”€ # μΆ”ν›„ μ—…λ°μ΄νŠΈ
β”œβ”€β”€ 
β”œβ”€β”€ 
β”œβ”€β”€ 
β”œβ”€β”€ 
β”œβ”€β”€ 
└── README.md

πŸ”¬ Methodology

Data Preprocessing

  • FLAC 파일 볡원 및 wav λ³€ν™˜
  • Mel Spectrogram 기반 νŠΉμ§• μΆ”μΆœ
  • K-means ν΄λŸ¬μŠ€ν„°λ§μ„ ν†΅ν•œ Unit λΆ„λ₯˜

Model Training

  • mBART 기반 sequence-to-sequence ν•™μŠ΅
  • Fairseq toolkit ν™œμš©
  • Unit-to-Unit translation μ΅œμ ν™”

Audio-Visual Generation

  • ν•œκ΅­μ–΄ unitμ—μ„œ μŒμ„± μž¬ν•©μ„±
  • ν™”μž μž„λ² λ”©μ„ ν™œμš©ν•œ μžμ—°μŠ€λŸ¬μš΄ μŒμ„± 생성
  • 립싱크가 λ™κΈ°ν™”λœ λΉ„λ””μ˜€ 생성

πŸ› οΈ Technical Details

Base Model

  • AV2AV: Audio-Visual to Audio-Visual translation model
  • Reference: Choi, J., et al., 2024

Fine-tuning Strategy

  • ν•œκ΅­μ–΄ 미지원 문제 해결을 μœ„ν•œ Fine-tuning
  • 병렬 ν•œ-영 μŒμ„± 데이터 ν™œμš©
  • Unit-level translation ν•™μŠ΅

πŸ‘₯ Team Members From Prometheus(AI club)

Name batch
μž₯μ§€μˆ˜ 6th
μœ μ§€ν˜œ 6th
μ‹ κ·œμ²  8th
이가연 8th

πŸ“ Citation

@misc{netflips2024,
  title={NetfLips: Unit-based Audiovisual Translation for Korean},
  author={μž₯μ§€μˆ˜, μœ μ§€ν˜œ, μ‹ κ·œμ² , 이가연},
  year={2024}
}

References

  • Choi, J., et al. (2024). AV2AV: Audio-Visual to Audio-Visual Translation

License

이 ν”„λ‘œμ νŠΈλŠ” MIT λΌμ΄μ„ μŠ€ ν•˜μ— λ°°ν¬λ©λ‹ˆλ‹€. μžμ„Έν•œ λ‚΄μš©μ€ LICENSE νŒŒμΌμ„ μ°Έμ‘°ν•˜μ„Έμš”.


Acknowledgments

This repository is built upon AV2AV and Fairseq. We appreciate the open-source of the projects.

About

[2025-2] Textless Direct Audio-Visual Speech Translation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages