Skip to content

Make participant encoder/decoder access thread-safe#3606

Open
woutd wants to merge 5 commits intomeetecho:masterfrom
woutd:audiobridge-multithread-fixes
Open

Make participant encoder/decoder access thread-safe#3606
woutd wants to merge 5 commits intomeetecho:masterfrom
woutd:audiobridge-multithread-fixes

Conversation

@woutd
Copy link

@woutd woutd commented Nov 12, 2025

Replaces gbooleans used for encoding/decoding status with mutexes to ensure all access to the participant encoder and decoder is thread-safe. Additionally, the reset status gboolean is now accessed atomically.

Currently only tested with participants using Opus encoder/decoder.

This needs to be carefully reviewed for performance issues and logic errors.

See also #3587

Replaces gbooleans used for encoding/decoding status with mutexes to ensure all access to the participant encoder and decoder is thread-safe.
Additionally, the reset status gboolean is now accessed atomically.
@januscla
Copy link

Thanks for your contribution, @woutd! Please make sure you sign our CLA, as it's a required step before we can merge this.

@lminiero
Copy link
Member

Thanks, I'll have a look when I can and get back to you!

Copy link
Member

@lminiero lminiero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a first quick review, so you can find some notes and comments inline, thanks!

janus_audiobridge_plainrtp_media plainrtp_media;
janus_mutex pmutex;
janus_mutex encoding_mutex; /* Encoding mutex to lock encoder instance */
janus_mutex decoding_mutex; /* Decoding mutex to lock decoder instance */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could use a single mutex for both, considering that the same participant thread takes care of both encoding and decoding, and the only thing we need to ensure is that the instances exist, but that's something we can discuss at a later stage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering the same. I also thought to maybe use the participant pmutex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to use a different mutex than pmutex, as we don't want other operations to interfere with transcoding, but we can revisit this later.

bpkt = (janus_audiobridge_buffer_packet *)jbp.data;
if(!g_atomic_int_compare_and_exchange(&participant->decoding, 0, 1)) {
janus_mutex_lock(&participant->decoding_mutex);
if(!participant->decoder) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is broken. If this participant is using G.711 there won't be a decoder instance. This means that lock and check should only occur for Opus participants, and not G.711 participants (for which we use static tables for encoding/decoding that don't require locks).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. For sure the participant->decoder == NULL is wrong for G.711.

I used participant->decoding_mutex because there was a g_atomic_int_compare_and_exchange check on decoding at this point.

I will look into this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the lock so it only is used when doing Opus decoding. This means it will no longer do a "decoding check" for G.711 but I do not think this changes much compared to the original logic were participant->decoding is briefly set to 1 and back to 0 in janus_audiobridge_hangup_media_internal.

if(g_atomic_int_get(&participant->active) && (participant->codec == JANUS_AUDIOCODEC_PCMA ||
participant->codec == JANUS_AUDIOCODEC_PCMU) && g_atomic_int_compare_and_exchange(&participant->encoding, 0, 1)) {
participant->codec == JANUS_AUDIOCODEC_PCMU)) {
janus_mutex_lock(&participant->encoding_mutex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: the lock should not be used when we're encoding with G.711, but only when we're using Opus.

Copy link
Author

@woutd woutd Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as above, I used a lock because of the original g_atomic_int_compare_and_exchange on participant->encoding. I will remove it.

Copy link
Member

@lminiero lminiero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if this took a while, I checked your changes and, apart from the considerations we made on whether it makes more sense to keep two distinct mutexes for encoding/encoding, or use just one, it looks fine to me. I haven't made any test yet, but I think it's ok to remove the draft status for now, and let people play with it to provide feedback. Did you keep on testing this in your own setup, by the way? Anything worth noting or has it worked as expected so far?

janus_audiobridge_plainrtp_media plainrtp_media;
janus_mutex pmutex;
janus_mutex encoding_mutex; /* Encoding mutex to lock encoder instance */
janus_mutex decoding_mutex; /* Decoding mutex to lock decoder instance */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to use a different mutex than pmutex, as we don't want other operations to interfere with transcoding, but we can revisit this later.

@lminiero lminiero marked this pull request as ready for review December 2, 2025 07:37
@woutd
Copy link
Author

woutd commented Dec 2, 2025

No problem @lminiero, thanks for picking this up.
We are running this on development servers since and haven't had any issues so far.

@spscream
Copy link
Contributor

spscream commented Dec 6, 2025

yesterday we've got deadlock on audiobridge sessions_mutex, some of threads are waiting for sessions_mutex

Thread 3469 (LWP 2036810):
#0  0x00007ff99aa17158 in __syscall6 (a6=<optimized out>, a5=<optimized out>, a4=<optimized out>, a3=<optimized out>, a2=<optimized out>, a1=<optimized out>, n=<optimized out>) at ./arch/x86
_64/syscall_arch.h:59
#1  syscall (n=<optimized out>) at src/misc/syscall.c:20
#2  0x00007ff99a6ae81b in ?? () from /usr/lib/libglib-2.0.so.0
#3  0x00007ff999538fd1 in janus_audiobridge_hangup_media (handle=0x7ff9776d8740) at plugins/janus_audiobridge.c:6432
#4  0x00005563b98078d0 in janus_ice_outgoing_traffic_handle (handle=0x7ff9955867f0, pkt=0x5563b9885840 <janus_ice_hangup_peerconnection>) at ice.c:4631
#5  0x00005563b980ad41 in janus_ice_outgoing_traffic_dispatch (source=0x7ff997931e10, callback=<optimized out>, user_data=<optimized out>) at ice.c:528
#6  0x00007ff99a64b255 in ?? () from /usr/lib/libglib-2.0.so.0
#7  0x00007ff99a64e387 in ?? () from /usr/lib/libglib-2.0.so.0
#8  0x00007ff99a64ec77 in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#9  0x00005563b97fc2c8 in janus_ice_handle_thread (data=0x7ff9955867f0) at ice.c:1354
#10 0x00007ff99a67e5f0 in ?? () from /usr/lib/libglib-2.0.so.0
#11 0x00007ff99aa3a34f in start (p=0x7ff955a218e8) at src/thread/pthread_create.c:207
#12 0x00007ff99aa3c965 in __clone () at src/thread/x86_64/clone.s:22
Backtrace stopped: frame did not save the PC

one thread, which holding sessions_mutex are waiting for participant->decoding_mutex:

Thread 3360 (LWP 2034780):
#0  0x00007ff99aa17158 in __syscall6 (a6=<optimized out>, a5=<optimized out>, a4=<optimized out>, a3=<optimized out>, a2=<optimized out>, a1=<optimized out>, n=<optimized out>) at ./arch/x86
_64/syscall_arch.h:59
#1  syscall (n=<optimized out>) at src/misc/syscall.c:20
#2  0x00007ff99a6ae81b in ?? () from /usr/lib/libglib-2.0.so.0
#3  0x00007ff99953865e in janus_audiobridge_hangup_media_internal (handle=handle@entry=0x7ff9606b7bd0) at plugins/janus_audiobridge.c:6517
#4  0x00007ff999538fd9 in janus_audiobridge_hangup_media (handle=0x7ff9606b7bd0) at plugins/janus_audiobridge.c:6433
#5  0x00005563b98078d0 in janus_ice_outgoing_traffic_handle (handle=0x7ff9968589b0, pkt=0x5563b9885840 <janus_ice_hangup_peerconnection>) at ice.c:4631
#6  0x00005563b980ad41 in janus_ice_outgoing_traffic_dispatch (source=0x7ff997b8de90, callback=<optimized out>, user_data=<optimized out>) at ice.c:528
#7  0x00007ff99a64b255 in ?? () from /usr/lib/libglib-2.0.so.0
#8  0x00007ff99a64e387 in ?? () from /usr/lib/libglib-2.0.so.0
#9  0x00007ff99a64ec77 in g_main_loop_run () from /usr/lib/libglib-2.0.so.0
#10 0x00005563b97fc2c8 in janus_ice_handle_thread (data=0x7ff9968589b0) at ice.c:1354
#11 0x00007ff99a67e5f0 in ?? () from /usr/lib/libglib-2.0.so.0
#12 0x00007ff99aa3a34f in start (p=0x7ff957ed58e8) at src/thread/pthread_create.c:207
#13 0x00007ff99aa3c965 in __clone () at src/thread/x86_64/clone.s:22
Backtrace stopped: frame did not save the PC

full log is here:
gdb_0512.log

upd: found a deadlock with decoding_mutex, wrote a review comment on pr

…talking in janus_audiobridge_participant_thread.

This can result in a deadlock because participant->decoding_mutex and audiobridge->mutex are locked in reverse order in janus_audiobridge_hangup_media_internal.
@spscream
Copy link
Contributor

we are getting some issues on this branch with ab.

Under load we got many "Participant queue-in contains too many packets, clearing now (count=5)" errors in logs and users are complaining about sound quality. On version without this pick we don't have any issues(only crashes occures from time to time).

And cpu usage are growing in time of the problems.

image

@woutd
Copy link
Author

woutd commented Jan 29, 2026

Thanks for the info @spscream

How many users were active during this peak?
Without the extra locks for thread safety you can handle the same amount of users without a noticeable extra CPU load?

I can understand the load is higher with the locks, but would expect most of the load comes from the OPUS encoding itself.

@lminiero Are there tools available to simulate load/users on the audio bridge?

@lminiero
Copy link
Member

@lminiero Are there tools available to simulate load/users on the audio bridge?

I guess that any WebRTC traffic simulator will do, as long as you take care of the Janus signalling. We created one for our own needs, but that is not open source, I'm afraid, as we use it regularly to provide stress testing campaigns as part of our consulting offerings.

As an alternative you can use plain RTP participants, as RTP traffic is easier to generate in bulk. Still up to you to orchestrate the Janus signalling to have those users join the AudioBridge. Besides, using plain RTP participants will indeed trigger the same things in the AudioBridge (media path is the same), but has a different threading model than WebRTC, and so the CPU load may be skewed as a consequence.

@spscream
Copy link
Contributor

spscream commented Jan 30, 2026

How many users were active during this peak?

~150 ab publishers during peaks. Cpu usage doesn't grow evenly with participants count. Rather, it's sporadic, occurring at random points in time, and doesn't correlate with the number of participants.

@woutd
Copy link
Author

woutd commented Feb 17, 2026

I simulated over 1000 audio bridge participants and cannot see any noticeable difference in CPU usage with or without the multi thread fixes.

I can also see the "Participant queue-in contains too many packets, clearing now (count=5)" log when reaching the server limit. But I also see this log without the fixes around the same number of participants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants