Location>code7788 >text

The problem of audio and video out-synchronization when ffmpeg merges and recording of audio soft encoding

Popularity:269 ℃/2025-03-18 17:31:09

Recently, due to some interference problems with the headphone 3.5mm interface, the previous access method was abandoned and the network audio stream needs to be re-enabled. I encountered some problems in this process. Let’s record it.
1. Access to network audio streams
The audio stream sources are different. I broadcast udp to port 10001 here. You can see the audio information by directly monitoring the 10001 port of this machine.

What you need to know in advance is the original sampling rate and bit depth of this network audio stream. Here I already know the sampling rate of 61000hz and the bit depth of 16.

At this time, we can directly play this audio with the ffplay program under ffmpeg. The command is as follows
.\ -f s16le -ar 61000 -ac 2 udp://0.0.0.0:10001

2. Combination of audio streams and video streams
Read the network audio stream through ffmpeg, and merge it with the rtsp video stream after reading.
This step is briefly explained:
First of all, it is similar to the way rtsp reads streams. ffmpeg reads audio streams with direct URL reading.
As shown below:

if ((ret = ffmpeg.avformat_open_input(&formatContext, _audioUrl, _inputAudioFormat, &options)) < 0)
{
    ffmpeg.av_dict_free(&options);
    ffmpeg.avformat_close_input(&formatContext);
    throw new Exception($"Could not open input path: {_audioUrl} (error '{FFmpegHelper.av_err2str(ret)}')");
}

It should be noted that because it is a network audio stream, this audio stream does not have bit depth and sampling rate information, and we need to specify it in advance.

AVCodecParameters* codecpar = formatContext->streams[_audioIndex]->codecpar;

if (SampleRate > 0)
{
    codecpar->sample_rate = 61000;
    codecpar->channels = ffmpeg.av_get_channel_layout_nb_channels((ulong)_channelLayout);
}
_inputAudioFormat = ffmpeg.av_find_input_format("s16le");

After the initialization enable probe above reads the network audio stream and video stream, the remaining processing is merged.
Merging is essentially a merger of the audioPacket and videoPackets read out, then convert the packet to a frame, and then convert it to the format to output, then build it into the packet, and finally interleaved the packet to the same _formatContext for saving. (It's a lot of vernacular 23333)
ffmpeg.av_interleaved_write_frame(_formatContext, outputPacket);

3. The problem of audio and video out of synchronization
After the above function is implemented, it is found that the audio and videos are out of sync with the saved videos.
The phenomenon is:

① During the real-time playback, the headphones feedback the sound to play in real-time. The actual action is slightly later than the sound, and the video is slightly earlier than the sound.
② During the video playback, the sound appears before the action, that is, the audio is earlier than the video.

Here ① is normal, because the audio is originally played through wave and it must take time.
But ② is not right. This is clearly the video and audio are not synchronized.
There are several possible reasons for the sound and picture dissynchronization:

① Network delay: Audio and video data may be affected by network delay during transmission, resulting in the time when the data arrives at the receiving end. This may be caused by network congestion, unstable transmission paths, etc.
② Codec Delay: The codec process of audio and video may introduce a certain delay, resulting in the data playback time being out of synchronization. Different codec algorithms and parameter settings may have an impact on delays.
③ Media synchronization mechanism: The audio and video data in the RTP stream are usually transmitted separately, and the receiver needs to play them synchronously based on time stamps and other information. If the synchronization mechanism is implemented incorrectly or missing, it will cause the audio and video to be out of sync.

In fact, there are many reasons, but in the final analysis, they are still synchronization issues.

The above phenomenon made me directly infer that the video entered the frame later, so I directly suspected that the video reading packet was later than the audio. Thinking of this, we directly verify:
The problem can be observed by simply printing a time before the stream is turned on at the place where the first video frame is written and the first audio frame is written. Here I used _stopwatch for accurate printing.
The result did verify the above idea - the audio offset is about 35ms, which can be ignored, while the video is about 400-500ms latency compared to audio.

Then we can synchronize the audio and video at this time.

4. Synchronization of audio and video

First of all, the synchronization here is designed for the audio and video out-synchronization problems caused by probe time deviation and codec delay.

The principle is very simple:The pts and dts of the moving video and audio are aligned by the deviation of the time when the first frame is written to the audio and video and the time when the stream is turned on.

First, from the above example, we can obtain _stopwatch.ElapsedMilliseconds, which are the deviations between the time when the audio is written to the first frame and the time when the video is written to the first frame and the time when the stream is turned on. Here are the audio 35ms and the video 400ms respectively. At this time, we move the audio and video pts and dts to the left, and move the heads of both to the starting position of the video.
It should be noted here that we need to operate on the outpacket after the time base conversion, otherwise it will not take effect.
ffmpeg.av_packet_rescale_ts(outputPacket, _audioCodecContext->time_base, _audioStream->time_base);

After the time base conversion above, we take the audio as an example, and then we get the _sampleStartPts (audio offset). Because ElapsedMilliseconds is milliseconds, we need to divide 1000 into seconds, and then multiply it by the time base. At this time, we get the actual amount of offset required by pts.
_sampleStartPts = _stopwatch.ElapsedMilliseconds * _audioStream->time_base.den / 1000;
(Special reminder: My default numerator here is 1, so I directly use the denominator. If the numerator is different, you need to add it for conversion)
After getting ptsOffset (_sampleStartPts), we can just move all output packets pts and dts.

outputPacket->pts -= _sampleStartPts;
outputPacket->dts -= _sampleStartPts;

Because the head end is to be aligned here, the offset is subtracted~
Finally, the above operation is corresponding to the video stream and the video and audio will be synchronized~

ps article originalidealy233, please send a private message to reprint ~