Skip to main content
  1. Writing/

Recoding Audio Data From Nest Camera

·1063 words

Right before the end of the year, I wrote about my updated approach to recording Nest video streams without having to worry about the Nest Aware subscription by reading from the video stream directly with the help of a .NET-based application I wrote, called FoggyCam.

In this blog post, I have an exciting update for those that rely on the tool as an experimental way to keep their Nest recordings local - it can now record audio as well. I’ll walk through the technical implementation and share some of the code snippets on how the content is processed on the machine where the recording is done.

First things first, I already mentioned in my earlier blog post that instead of a hacky way to get the video stream by stitching together the output of the “snapshot” API, I switched to using the Nest stream directly, which, in turn, is channeled to both the apps and the website using protocol buffers.

There are several packet types that Google sends to the client application, captured in PacketType.cs:

namespace foggycam.Models
    public enum PacketType
        PING = 1,
        HELLO = 100,
        PING_CAMERA = 101,
        AUDIO_PAYLOAD = 102,
        START_PLAYBACK = 103,
        STOP_PLAYBACK = 104,
        CLOCK_SYNC_ECHO = 105,
        LATENCY_MEASURE = 106,
        TALKBACK_LATENCY = 107,
        METADATA_REQUEST = 108,
        OK = 200,
        ERROR = 201,
        PLAYBACK_BEGIN = 202,
        PLAYBACK_END = 203,
        PLAYBACK_PACKET = 204,
        CLOCK_SYNC = 206,
        REDIRECT = 207,
        TALKBACK_BEGIN = 208,
        TALKBACK_END = 209,
        METADATA = 210,
        METADATA_ERROR = 211,

So far, I’ve been dealing with LONG_PLAYBACK_PACKET and PLAYBACK_PACKET, which is great, but only captures the video part of the stream. If you have a Nest camera in your household, you probably already know that it also captures the audio from its surroundings, so wouldn’t it be neat if I had a way to grab that as well?

The trick was in properly identifying existing packets. Audio data is still caught in standard playback packets, but on different channels, that are determined by the channel ID:

if (packet.ChannelId == videoChannelId)
    Console.WriteLine("[log] Video packet received.");
    byte[] h264Header = { 0x00, 0x00, 0x00, 0x01 };
    var writingBlock = new byte[h264Header.Length + packet.Payload.Length];
    h264Header.CopyTo(writingBlock, 0);
    packet.Payload.CopyTo(writingBlock, h264Header.Length);

else if (packet.ChannelId == audioChannelId)
    Console.WriteLine("[log] Audio packet received.");
    Console.WriteLine("[log] Unknown channel: " + packet.Payload);

What’s the process of getting the channel IDs? Parsing the starting playback packet:

if ((CodecType)registeredStream.CodecType == CodecType.H264)
    videoChannelId = registeredStream.ChannelId;
else if ((CodecType)registeredStream.CodecType == CodecType.AAC)
    audioChannelId = registeredStream.ChannelId;

Awesome - depending on the codec (video in H264, audio in AAC), I now am able to identify the exact channels and parse playback packets accordingly.

I just needed to actually get the data, store it in a local buffer, and then merge it with the video stream. To start, I need to make sure that the camera has the audio stream enabled, and for that I can check for camera properties when playback is started, inside the StartPlayback function:

if ((bool)["audio.enabled"])

When packets are received, they are written to a generic list of byte arrays, that will be the “dumping ground” until the content is written to a file.

To actually write the content to a file, I am using a ProcessBuffers call first, that copies the content of the existing global buffers into a local instance before writing them to a file:

private static void ProcessBuffers(List<byte[]> videoStream, List<byte[]> audioStream)
    List<byte[]> videoBuffer = new List<byte[]>();
    List<byte[]> audioBuffer = new List<byte[]>();

    for (int i = 0; i < videoStream.Count; i++)

    // Ideally, this needs to match the batch of video frames, so we're snapping to the video
    // buffer length as the baseline. I am not yet certain this is a good assumption, but time will tell.
    for (int i = 0; i < videoBuffer.Count; i++)
            // There is a chance there are not enough audio packets
            // so it's worth to pre-emptively catch this scenario.

    var fileName = DateTime.Now.ToString("yyyy-dd-M--HH-mm-ss") + ".mp4";
    DumpToFile(videoBuffer, audioBuffer, fileName);

DumpToFile is then called to process the binary content - it uses the ffmpeg process to first create the video file, and then “mux” the audio stream into the content:

static void DumpToFile(List<byte[]> videoBuffer, List<byte[]> audioBuffer, string fileName)
    // Compile the initial video file (without any audio).
    var startInfo = new ProcessStartInfo(CONFIG.ffmpeg_path.ToString());
    startInfo.RedirectStandardInput = true;
    startInfo.RedirectStandardOutput = true;
    startInfo.RedirectStandardError = true;
    startInfo.UseShellExecute = false;

    var argumentBuilder = new List<string>();
    argumentBuilder.Add("-loglevel panic");
    argumentBuilder.Add("-f h264");
    argumentBuilder.Add("-i pipe:");
    argumentBuilder.Add("-c:v libx264");
    argumentBuilder.Add("-bf 0");
    argumentBuilder.Add("-pix_fmt yuv420p");

    startInfo.Arguments = string.Join(" ", argumentBuilder.ToArray());

    var _ffMpegProcess = new Process();
    _ffMpegProcess.EnableRaisingEvents = true;
    _ffMpegProcess.OutputDataReceived += (s, e) => { Debug.WriteLine(e.Data); };
    _ffMpegProcess.ErrorDataReceived += (s, e) => { Debug.WriteLine(e.Data); };
    _ffMpegProcess.StartInfo = startInfo;

    Console.WriteLine($"[log] Starting write to {fileName}...");


    byte[] fullBuffer = videoBuffer.SelectMany(a => a).ToArray();
    Console.WriteLine("Full buffer: " + fullBuffer.Length);

    using (var memoryStream = new MemoryStream(fullBuffer))


    Process[] pname = Process.GetProcessesByName("ffmpeg");
    while (pname.Length > 0)
        pname = Process.GetProcessesByName("ffmpeg");

    argumentBuilder = new List<string>();
    argumentBuilder.Add($"-i {fileName}");
    argumentBuilder.Add("-i pipe:");

    startInfo.Arguments = string.Join(" ", argumentBuilder.ToArray());

    var _ffMpegAudioProcess = new Process();
    _ffMpegAudioProcess.EnableRaisingEvents = true;
    _ffMpegAudioProcess.OutputDataReceived += (s, e) => { Debug.WriteLine(e.Data); };
    _ffMpegAudioProcess.ErrorDataReceived += (s, e) => { Debug.WriteLine(e.Data); };

    _ffMpegAudioProcess.StartInfo = startInfo;

    Console.WriteLine($"[log] Starting mux audio to {fileName}...");


        Console.WriteLine("[log] Got access to the process input stream.");
        foreach (var byteSet in audioBuffer)
            _ffMpegAudioProcess.StandardInput.BaseStream.Write(byteSet, 0, byteSet.Length);
        Console.WriteLine("[log] Done writing input stream.");


        pname = Process.GetProcessesByName("ffmpeg");
        while (pname.Length > 0)
            pname = Process.GetProcessesByName("ffmpeg");
    catch (Exception ex)
        Console.WriteLine("[error] An error occurred writing the audio file.");
        Console.WriteLine($"[error] {ex.Message}");

    Console.WriteLine($"[log] Writing of {fileName} completed.");

The work of this method is assumes that all packets were received in order both for video and audio, which might not always be reliable but as a quick and easy recording approach works pretty well. That’s all it took to add audio support for stream recording in FoggyCam - the data was there, it just needed to be captured.

I also haven’t found a reliable way just yet to write two streams at once - you can check out my question on Stack Overflow on this topic. I am very much open to suggestions on optimizing my current implementation and remove the need to have intermediary files.