Make participant encoder/decoder access thread-safe#3606
Make participant encoder/decoder access thread-safe#3606woutd wants to merge 5 commits intomeetecho:masterfrom
Conversation
Replaces gbooleans used for encoding/decoding status with mutexes to ensure all access to the participant encoder and decoder is thread-safe. Additionally, the reset status gboolean is now accessed atomically.
|
Thanks, I'll have a look when I can and get back to you! |
lminiero
left a comment
There was a problem hiding this comment.
I've done a first quick review, so you can find some notes and comments inline, thanks!
| janus_audiobridge_plainrtp_media plainrtp_media; | ||
| janus_mutex pmutex; | ||
| janus_mutex encoding_mutex; /* Encoding mutex to lock encoder instance */ | ||
| janus_mutex decoding_mutex; /* Decoding mutex to lock decoder instance */ |
There was a problem hiding this comment.
I'm wondering if we could use a single mutex for both, considering that the same participant thread takes care of both encoding and decoding, and the only thing we need to ensure is that the instances exist, but that's something we can discuss at a later stage.
There was a problem hiding this comment.
I was wondering the same. I also thought to maybe use the participant pmutex.
There was a problem hiding this comment.
I think it makes more sense to use a different mutex than pmutex, as we don't want other operations to interfere with transcoding, but we can revisit this later.
src/plugins/janus_audiobridge.c
Outdated
| bpkt = (janus_audiobridge_buffer_packet *)jbp.data; | ||
| if(!g_atomic_int_compare_and_exchange(&participant->decoding, 0, 1)) { | ||
| janus_mutex_lock(&participant->decoding_mutex); | ||
| if(!participant->decoder) { |
There was a problem hiding this comment.
I think this is broken. If this participant is using G.711 there won't be a decoder instance. This means that lock and check should only occur for Opus participants, and not G.711 participants (for which we use static tables for encoding/decoding that don't require locks).
There was a problem hiding this comment.
You are right. For sure the participant->decoder == NULL is wrong for G.711.
I used participant->decoding_mutex because there was a g_atomic_int_compare_and_exchange check on decoding at this point.
I will look into this.
There was a problem hiding this comment.
I moved the lock so it only is used when doing Opus decoding. This means it will no longer do a "decoding check" for G.711 but I do not think this changes much compared to the original logic were participant->decoding is briefly set to 1 and back to 0 in janus_audiobridge_hangup_media_internal.
src/plugins/janus_audiobridge.c
Outdated
| if(g_atomic_int_get(&participant->active) && (participant->codec == JANUS_AUDIOCODEC_PCMA || | ||
| participant->codec == JANUS_AUDIOCODEC_PCMU) && g_atomic_int_compare_and_exchange(&participant->encoding, 0, 1)) { | ||
| participant->codec == JANUS_AUDIOCODEC_PCMU)) { | ||
| janus_mutex_lock(&participant->encoding_mutex); |
There was a problem hiding this comment.
Same here: the lock should not be used when we're encoding with G.711, but only when we're using Opus.
There was a problem hiding this comment.
Same reason as above, I used a lock because of the original g_atomic_int_compare_and_exchange on participant->encoding. I will remove it.
lminiero
left a comment
There was a problem hiding this comment.
Apologies if this took a while, I checked your changes and, apart from the considerations we made on whether it makes more sense to keep two distinct mutexes for encoding/encoding, or use just one, it looks fine to me. I haven't made any test yet, but I think it's ok to remove the draft status for now, and let people play with it to provide feedback. Did you keep on testing this in your own setup, by the way? Anything worth noting or has it worked as expected so far?
| janus_audiobridge_plainrtp_media plainrtp_media; | ||
| janus_mutex pmutex; | ||
| janus_mutex encoding_mutex; /* Encoding mutex to lock encoder instance */ | ||
| janus_mutex decoding_mutex; /* Decoding mutex to lock decoder instance */ |
There was a problem hiding this comment.
I think it makes more sense to use a different mutex than pmutex, as we don't want other operations to interfere with transcoding, but we can revisit this later.
|
No problem @lminiero, thanks for picking this up. |
|
yesterday we've got deadlock on audiobridge sessions_mutex, some of threads are waiting for sessions_mutex one thread, which holding sessions_mutex are waiting for participant->decoding_mutex: full log is here: upd: found a deadlock with decoding_mutex, wrote a review comment on pr |
…talking in janus_audiobridge_participant_thread. This can result in a deadlock because participant->decoding_mutex and audiobridge->mutex are locked in reverse order in janus_audiobridge_hangup_media_internal.
|
Thanks for the info @spscream How many users were active during this peak? I can understand the load is higher with the locks, but would expect most of the load comes from the OPUS encoding itself. @lminiero Are there tools available to simulate load/users on the audio bridge? |
I guess that any WebRTC traffic simulator will do, as long as you take care of the Janus signalling. We created one for our own needs, but that is not open source, I'm afraid, as we use it regularly to provide stress testing campaigns as part of our consulting offerings. As an alternative you can use plain RTP participants, as RTP traffic is easier to generate in bulk. Still up to you to orchestrate the Janus signalling to have those users join the AudioBridge. Besides, using plain RTP participants will indeed trigger the same things in the AudioBridge (media path is the same), but has a different threading model than WebRTC, and so the CPU load may be skewed as a consequence. |
~150 ab publishers during peaks. Cpu usage doesn't grow evenly with participants count. Rather, it's sporadic, occurring at random points in time, and doesn't correlate with the number of participants. |
|
I simulated over 1000 audio bridge participants and cannot see any noticeable difference in CPU usage with or without the multi thread fixes. I can also see the "Participant queue-in contains too many packets, clearing now (count=5)" log when reaching the server limit. But I also see this log without the fixes around the same number of participants. |

Replaces gbooleans used for encoding/decoding status with mutexes to ensure all access to the participant encoder and decoder is thread-safe. Additionally, the reset status gboolean is now accessed atomically.
Currently only tested with participants using Opus encoder/decoder.
This needs to be carefully reviewed for performance issues and logic errors.
See also #3587