Freely Accessing Your Own Nest HD Camera Stream

Table of Contents

Table of contents #

Overview
We have to go deeper
Understanding the new auth
Exploring WebSockets and exchanged data
Getting the video frames
Building the final video
Acknowledgments
Conclusion

Overview #

Way back in 2018, I coded up a little project that allowed me to record my Nest camera stream in a very hacky way. I wanted to get the raw video off of the camera without paying for a Nest Aware subscription. The Nest Aware subscription did offer quite a few interesting things, but I didn’t need 90% of them - I just wanted to get a constant recording of my house. Unfortunately, there did not seem to be any API available for this purpose, and the only alternative was making the camera public, and then using the streaming endpoint to get the feed, which is obviously a very insecure way to deal with an indoor camera. I mean, how much do you like the possibility that a stranger can at any point peek inside your living quarters?

At the time, I ended up using an undocumented API that was available through the Nest website, that allowed me to get a low-resolution snapshot of the camera stream. I would grab a bunch of JPEGs and then would stitch them into something that resembled a video with the help of ffmpeg. The code to do that was an ungodly mess. The thing I am most proud of in that implementation was the fact that I did proper authentication, and even had ways to handle two-factor tokens. Over time, Google shifted how they do authentication (making users rely on Google accounts, which is a sensible decision), and so they broke foggycam. But even without this change, the quality of the video that was produced was very poor. I shelved the app for some time.

Now, like the phoenix rises from the ashes, foggycam rises and becomes better and more useful. I finally got to spend some time investigating what I could do to make the application produce higher-quality output (also known as “video that looks like it came from a real camera”), and I think I finally found a way to do that.

We have to go deeper #

When I first built foggycam, I noticed that the Nest website used WebSockets to channel the stream to my browser. Given that at the time (we’re talking 2 years ago) debugging WebSockets was a bit painful, I wanted to see if there is a way for me to get the stream without digging into the world of TCP & UDP, so I started with HTTP APIs first. However, I quickly realized that in order for me to get something useful out of the project, I will need to deal with sockets eventually because there was just no way to match the 1080p quality through a HTTP API. Other priorities got in the way, so I never got to actually dive too deep into what WebSocket sorcery Nest was doing.

But, right before rolling into 2021, I figured that maybe I make another attempt at solving the problem, and take a look under the hood. I logged in to https://home.nest.com and tried inspecting what kinds of APIs Google used this time.

Understanding the new auth #

From what I’ve managed to inspect, seems like the bulk of the API surface is the same, with the only major change being the Google authentication. More than that, I’ve noticed that in a lot of API calls there is a token being passed to the Authorization header.

Example of a request with a token attached to it

Good enough, but that still doesn’t answer how to actually obtain said token. Well, it turns out there is another call made up the stack to the following URL:

https://nestauthproxyservice-pa.googleapis.com/v1/issue-jwt

Yes! I am dealing with a JWT token here, that hopefully should be a breeze to get, if only I can trace its origin. The call made to the Nest API above is a POST request with the following payload format:

{
	"policy_id":"authproxy-oauth-policy",
	"google_oauth_access_token":"SOME_ACCESS_TOKEN",
	"embed_google_oauth_access_token":true,
	"expire_after":"3600s"
}

Cool, halfway there. I now still need to understand where I can get the Google’s OAuth token. Luckily, right before the call to /issue-jwt, there is a call to the following URL:

https://accounts.google.com/o/oauth2/iframerpc?action=issueToken&response_type=token id_token&login_hint=SOME_UNIQUE_HINT&client_id=CLIENT_ID&origin=https://home.nest.com&scope=openid profile email https://www.googleapis.com/auth/nest-account&ss_domain=https://home.nest.com

Jackpot - looking at this response, I am able to extract the information necessary:

{
	"token_type":"Bearer",
	"access_token":"MY_SECRET_TOKEN",
	"scope":"email profile https://www.googleapis.com/auth/userinfo.email https://www.googleapis.com/auth/userinfo.profile openid https://www.googleapis.com/auth/nest-account",
	"login_hint":"LOGIN_HINT",
	"expires_in":3599,
	"id_token":"A_GIANT_ID_TOKEN",
	"session_state":{"extraQueryParams":{"authuser":"0"}}
}

For the new release of foggycam, I decided to go with .NET and C#, specifically. Because I knew that I will rewrite the entire thing with WebSockets in mind, I thought I would experiment with the platform I am probably most comfortable in. The entire token acquisition function ended up being this:

static async Task<string> GetGoogleToken(string issueToken, string cookie)
{
    var tokenUri = new Uri(issueToken);
    var referrerDomain = string.Empty;

    try
    {
        referrerDomain = HttpUtility.ParseQueryString(tokenUri.Query).Get("ss_domain");
    }
    catch (Exception ex)
    {
        throw new ArgumentException("[error] Could not parse the referrer domain out of the token.");
    }

    try
    {
        var httpClient = new HttpClient();
        var request = new HttpRequestMessage
        {
            RequestUri = new Uri(issueToken),
            Method = HttpMethod.Get,
            Headers =
            {
                { "Sec-Fetch-Mode", "cors" },
                { "User-Agent", USER_AGENT },
                { "X-Requested-With", "XmlHttpRequest" },
                { "Referer", "https://accounts.google.com/o/oauth2/iframe" },
                { "cookie", cookie }
            }
        };

        var response = await httpClient.SendAsync(request);

        if (response.IsSuccessStatusCode)
        {
            dynamic rawResponse = JsonConvert.DeserializeObject(await response.Content.ReadAsStringAsync());
            var accessToken = rawResponse.access_token;

            var parameters = new Dictionary<string, string> { { "embed_google_oauth_access_token", "true" }, { "expire_after", "3600s" }, { "google_oauth_access_token", $"{ accessToken}" }, { "policy_id", "authproxy-oauth-policy" } };
            var encodedContent = new FormUrlEncodedContent(parameters);

            request = new HttpRequestMessage
            {
                RequestUri = new Uri("https://nestauthproxyservice-pa.googleapis.com/v1/issue_jwt"),
                Method = HttpMethod.Post,
                Content = encodedContent,
                Headers =
                {
                    { "Authorization", $"Bearer {accessToken}" },
                    { "User-Agent", USER_AGENT },
                    { "x-goog-api-key", API_KEY },
                    { "Referer", referrerDomain }
                }
            };

            response = await httpClient.SendAsync(request);
            if (response.IsSuccessStatusCode)
            {
                rawResponse = JsonConvert.DeserializeObject(await response.Content.ReadAsStringAsync());
                return rawResponse.jwt;
            }
            else
            {
                Console.WriteLine(response.StatusCode);
            }
        }
    }
    catch (Exception ex)
    {
        throw new ApplicationException($"Could not perform Google authentication. {ex.Message}");
    }

    return null;
}

I tried to mimic my browser in all these requests as much as possible to make sure that I am not tripping up some checks that validate the kind of client that is accessing the APIs. Upon execution of the call above, I am able to work with a clean token! I think I can now execute Nest API calls with the JWT token in hand.

To test this theory out, I built out a function to perform a call to the API that queries available cameras - after all, I did not want to hardcode a camera ID:

static async Task<object> GetCameras(string token)
{
    var httpClient = new HttpClient();
    var request = new HttpRequestMessage
    {
        RequestUri = new Uri($"{CAMERA_API_HOSTNAME}/api/cameras.get_owned_and_member_of_with_properties"),
        Method = HttpMethod.Get,
        Headers =
        {
            { "Cookie", $"user_token={token}" },
            { "User-Agent", USER_AGENT },
            { "Referer", NEST_API_HOSTNAME }
        }
    };

    var response = await httpClient.SendAsync(request);
    if (response.IsSuccessStatusCode)
    {
        var rawResponse = await response.Content.ReadAsStringAsync();

        return JsonConvert.DeserializeObject(rawResponse);
    }

    return null;
}

Sure enough, this returned me the list of cameras I have available:

The easy stuff was out of the way - I had the auth done, and I had access to the required tokens. Now, I needed to deal with the actual WebSocket stream.

Exploring WebSockets and exchanged data #

It was relatively easy to see that WebSockets were in use, because I could just pop open the WS tab in Firefox:

But analyzing those is extremely painful - everything sent to and from the service is in binary format and analyzing binary WebSocket payloads through any browser is shockingly complicated. By that I mean that you are better off guessing what’s inside the payload yourself by rolling the dice instead of trying to read the binary output from either Firefox or Chrome. Wireshark was somewhat useful in this domain, but there was a lot of trial-and-error in actually setting up the socket content capture. The tool that I ended up using to debug this was Fiddler - the classic version was really good for tracing WebSocket payloads.

Example of WebSocket analysis with Fiddler

It actually showed the binary content sent over the wire in HEX form, just like I wanted. This is as close to perfection as I was going to get. I started tracking what packets were being sent back and forth, and I noticed something interesting - there is an intro packet, that seems to contain the camera ID, along with the authentication token that I obtained earlier, after which Nest would send an ACK (OK) request. After that, another request was sent to kick-off the broadcast, after which the Nest service would stream back what I can only assume were individual video frames.

Looking at the JavaScript call tree, I noticed something interesting - every request seems to follow a pattern of creating these “encoded” payloads that are being sent to the service (and are decoded in a similar fashion):

Example of JavaScript function that creates a packet

Digging through the format a bit more, it clicked - they’re using protocol buffers. Still, doesn’t tell me much about the exact format, but it’s a start. I began searching for other projects that maybe have done something similar, and that’s when I stumbled across homebridge-nest-cam, a project by Brandon McFarlin, that enables one to plug in their Nest camera stream into Homebridge. Looking at his code, I quickly realized that we reached some very similar conclusion, and Brandon has put together the formal representations of the packets before they get shaped into the ProtoBuf format. This means that now I can remove the need to go through the obfuscated JavaScript and implement the .NET lookalikes for the data models.

To make my life even easier, I leveraged the work of Mark Gravell, software engineer at Stack Overflow, who created a .NET library for protocol buffers - protobuf-net. This library provides a very nice level of abstraction, through which I can create C# classes and decorate them with the necessary attributes to make them serializable. Like this:

using ProtoBuf;

namespace foggycam.Models
{
    [ProtoContract]
    public class PlaybackPacket
    {
        [ProtoMember(1)]
        public int session_id { get; set; }
        [ProtoMember(2)]
        public int channel_id { get; set; }
        [ProtoMember(3)]
        public long timestamp_delta { get; set; }
        [ProtoMember(4)]
        public byte[] payload { get; set; }
        [ProtoMember(5)]
        public int latency_rtp_sequence { get; set; }
        [ProtoMember(6)]
        public int latency_rtp_ssrc { get; set; }
        [ProtoMember(7)]
        public int[] directors_cut_regions { get; set; }
    }
}

Unlike Brandon, I did not need to implement a comprehensive camera API connector. I just wanted the simple video stream (don’t even care about audio at this time) that I can drop in a local MP4 file. With the models defined, it was time to try and send the data to Nest and see if the service will talk back. To do that, I needed to have a way to talk WebSocket. And again, I leveraged the work of another awesome developer - Kerry Jiang, who put together WebSocket4Net.

I wrapped the connection setup function in a way that initializes the socket and sends the first batch of “hello” data:

static void SetupConnection(string host, string cameraUuid, string deviceId, string token)
{
    var tc = new TokenContainer();
    tc.olive_token = token;

    using (var mStream = new MemoryStream())
    {
        Serializer.Serialize(mStream, tc);

        var helloRequestBuffer = new HelloContainer();
        helloRequestBuffer.protocol_version = 3;
        helloRequestBuffer.uuid = cameraUuid;
        helloRequestBuffer.device_id = deviceId;
        helloRequestBuffer.require_connected_camera = false;
        helloRequestBuffer.user_agent = USER_AGENT;
        helloRequestBuffer.client_type = 3;
        helloRequestBuffer.authorize_request = mStream.GetBuffer();

        using (var finalMStream = new MemoryStream())
        {
            Serializer.Serialize(finalMStream, helloRequestBuffer);

            var dataBuffer = PreformatData(PacketType.HELLO, finalMStream.ToArray());
            var target = $"wss://{host}:80/nexustalk";
            Console.WriteLine($"[log] Setting up connection to {target}...");

            ws = new WebSocket(target, sslProtocols: SslProtocols.Tls12 | SslProtocols.Tls11 | SslProtocols.Tls);
            ws.EnableAutoSendPing = true;
            ws.AutoSendPingInterval = 5;
            ws.Security.AllowNameMismatchCertificate = true;
            ws.Security.AllowUnstrustedCertificate = true;
            ws.DataReceived += Ws_DataReceived;
            ws.Error += Ws_Error;
            ws.MessageReceived += Ws_MessageReceived;

            ws.Opened += (s, e) =>
            {
                ws.Send(dataBuffer, 0, dataBuffer.Length);
            };
            ws.Open();
        }
    }
}

One important call-out here - at first, I was getting a lot of certificate errors, and I thought that Fiddler, my system proxy for debugging purposes, was interfering with the TLS handshake. As it turns out, I did not have the port specified in the target variable. Without the port, the socket connection will yell and curse at you. With the port, it connects just fine. I should’ve looked at the browser WebSocket requests, because then I would’ve learned that there is actually a :80 suffixed to the destination. You live and you learn.

So here is the kicker, and this made debugging really hard. I created a custom function, PreformatData, that would encapsulate the message information in an envelope that Nest services should understand (or so I thought):

static byte[] PreformatData(PacketType packetType, byte[] buffer)
{
    byte[] finalBuffer;
    if (packetType == PacketType.LONG_PLAYBACK_PACKET)
    {
        var requestBuffer = new byte[5];
        requestBuffer[0] = (byte)packetType;
        var byteData = BitConverter.GetBytes((uint)buffer.Length);
        Array.Reverse(byteData);

        Buffer.BlockCopy(byteData, 0, requestBuffer, 1, byteData.Length);
        finalBuffer = new byte[requestBuffer.Length + buffer.Length];
        requestBuffer.CopyTo(finalBuffer, 0);
        buffer.CopyTo(finalBuffer, requestBuffer.Length);
    }
    else
    {
        var requestBuffer = new byte[3];
        requestBuffer[0] = (byte)packetType;
        var byteData = BitConverter.GetBytes((ushort)buffer.Length);
        Array.Reverse(byteData);

        Buffer.BlockCopy(byteData, 0, requestBuffer, 1, byteData.Length);
        finalBuffer = new byte[requestBuffer.Length + buffer.Length];
        requestBuffer.CopyTo(finalBuffer, 0);
        buffer.CopyTo(finalBuffer, requestBuffer.Length);
    }

    return finalBuffer;
}

Basically, every packet that was sent needed to have a block of bytes that define its length after the packet type (which is the first byte entry in the buffer). A shout-out to Brandon here, as I realized this was a piece I was missing in my socket code. As I was fiddling with these values, I could not for the life of me figure out why the Nest service was not responding, despite the fact that I successfully connected to it. As it turns out, Nest services will not respond at all if they do not understand the message. No error or alert, the message will just disappear and you will never know.

Fiddler came to the rescue, as I was able to compare what was sent over the wire relatively quickly, thanks once again to the ability to view the HEX-formatted payloads. In my code, two lines saved me:

string hex = BitConverter.ToString(finalBuffer);
string tdata = Encoding.ASCII.GetString(FromHex(hex));

This allowed me to format my message in a way that enabled comparison with the payloads in Fiddler. What was the issue when I first started sending payloads? I messed up the endianness when defining the message length. With a quick Array.Reverse, I was back on track and was able to compose the message correctly. Nest gave me a friendly wave back when I send the first “hello” authorization packet. I’m in business!

Getting the video frames #

Me being authorized against the Nest socket server was one piece of the puzzle, but I needed to now address the last challenge - kick off the stream. To do that, I needed to send a ProtoBuf-encoded payload that starts it, and it could be done with the following function:

private static void StartPlayback(dynamic cameraInfo)
{
    var primaryProfile = StreamProfile.VIDEO_H264_2MBIT_L40;

    string[] capabilities = ((JArray)cameraInfo.capabilities).ToObject<string[]>();
    var matchingCapabilities = from c in capabilities where c.StartsWith("streaming.cameraprofile") select c;

    List<int> otherProfiles = new List<int>();
    foreach (var capability in matchingCapabilities)
    {
        var cleanCapability = capability.Replace("streaming.cameraprofile.", "");
        var successParsingEnum = Enum.TryParse(cleanCapability, out StreamProfile targetProfile);

        if (successParsingEnum)
        {
            otherProfiles.Add((int)targetProfile);
        }
    }

    StartPlayback sp = new StartPlayback();
    sp.session_id = new Random(745).Next(0, 100);
    sp.profile = (int)primaryProfile;
    sp.other_profiles = otherProfiles.ToArray<int>();

    using (MemoryStream spStream = new MemoryStream())
    {
        Serializer.Serialize(spStream, sp);
        var formattedSPOutput = PreformatData(PacketType.START_PLAYBACK, spStream.ToArray());
        ws.Send(formattedSPOutput, 0, formattedSPOutput.Length);
    }
}

What I am doing here is creating a StartPlayback object that contains the session ID (which can be a random number), the primary camera profile (that determines what kind of stream quality I am processing), and additional profiles that exist for the camera (extracted from the camera description API call earlier), and then sending it to the socket against which I am already authenticated. The moment that is done, I started receiving packets back. In turn, those packets contained all the video frames that I needed to start assembling the final output. Specifically, upon deserialization, every packet would have a payload byte array that I can use to create the final MP4 file.

Extracting the right information from the received packet is done by ProcessReceivedData, that takes the byte array of the entire payload, removes the packet type and the length, and then hands off the processing to HandlePacketData. In this case, there are two packet types - the long one, that needs 4 bytes to store the length, and the short one, which only uses 2.

private static void ProcessReceivedData(byte[] buffer)
{
    var headerLength = 0;
    uint length = 0;
    var type = 0;

    type = buffer[0];

    try
    {
        Debug.WriteLine("Received packed type: " + (PacketType)type);

        if ((PacketType)type == PacketType.LONG_PLAYBACK_PACKET)
        {
            headerLength = 5;
            var lengthBytes = new byte[4];
            Buffer.BlockCopy(buffer, 1, lengthBytes, 0, lengthBytes.Length);
            Array.Reverse(lengthBytes);
            length = BitConverter.ToUInt32(lengthBytes);
            Console.WriteLine("[log] Declared playback packet length: " + length);
        }
        else
        {
            headerLength = 3;
            var lengthBytes = new byte[2];
            Buffer.BlockCopy(buffer, 1, lengthBytes, 0, lengthBytes.Length);
            Array.Reverse(lengthBytes);
            length = BitConverter.ToUInt16(lengthBytes);
            Console.WriteLine("[log] Declared long playback packet length: " + length);
        }

        var payloadEndPosition = length + headerLength;

        Index top = headerLength;
        Index bottom = (Index)payloadEndPosition;

        var rawPayload = buffer[top..bottom];
        using (var dStream = new MemoryStream(rawPayload))
        {
            HandlePacketData((PacketType)type, rawPayload);
        }

    }
    catch (Exception ex)
    {
        Console.WriteLine("[error] Error with packet capture.");
        Console.WriteLine(ex.Message);
    }

}

In turn, HandlePacketData is just a collection of switch statements that determine what I need to do with the packet once it arrives:

private static void HandlePacketData(PacketType type, byte[] rawPayload)
{
    switch (type)
    {
        case PacketType.OK:
            authorized = true;
            break;
        case PacketType.PING:
            Console.WriteLine("[log] Ping.");
            break;
        case PacketType.PLAYBACK_BEGIN:
            HandlePlaybackBegin(rawPayload);
            break;
        case PacketType.PLAYBACK_PACKET:
            HandlePlayback(rawPayload);
            break;
        default:
            Console.WriteLine("[streamer] Unknown type.");
            break;
    }
}

HandlePlayback is where the rubber meets the road:

private static void HandlePlayback(byte[] rawPayload)
{
    using (MemoryStream stream = new MemoryStream(rawPayload))
    {
        var packet = Serializer.Deserialize<PlaybackPacket>(stream);

        if (packet.channel_id == videoChannelId)
        {
            byte[] h264Header = { 0x00, 0x00, 0x00, 0x01 };
            var writingBlock = new byte[h264Header.Length + packet.payload.Length];
            h264Header.CopyTo(writingBlock, 0);
            packet.payload.CopyTo(writingBlock, h264Header.Length);

            videoStream.Add(writingBlock);
        }
    }
}

This function does the last bit of processing legwork - it deserializes the packet, appends a H.264 header, and adds it to a generic list of byte arrays (yes, I know it’s not very efficient) that is later used to create the composite video.

Building the final video #

Last thing - I need to take all those captured frames and compile them in a MP4 file, like I mentioned earlier. To do that, I run a loop in my console application (again, not very efficient):

while (true)
{
    StartPlayback(camera.items[0]);
    await Task.Delay(35000);

    List<byte[]> copyList = new List<byte[]>();
    videoStream.ForEach(x => copyList.Add(x));
    videoStream.Clear();

    DumpToFile(copyList, DateTime.Now.ToString("yyyy-dd-M--HH-mm-ss") + ".mp4");
}

Here, DumpToFile uses ffmpeg to post-process the raw data from the standard input, and outputs it to a file that is timestamped right in the name for easy consumption:

static void DumpToFile(List<byte[]> buffer, string filename)
{
    var startInfo = new ProcessStartInfo(@"D:\binaries\ready\ffmpeg.exe");
    startInfo.RedirectStandardInput = true;
    startInfo.RedirectStandardOutput = true;
    startInfo.RedirectStandardError = true;
    startInfo.UseShellExecute = false;

    var argumentBuilder = new List<string>();
    argumentBuilder.Add("-loglevel panic");
    argumentBuilder.Add("-f h264");
    argumentBuilder.Add("-i pipe:");
    argumentBuilder.Add("-c:v libx264");
    argumentBuilder.Add("-bf 0");
    argumentBuilder.Add("-pix_fmt yuv420p");
    argumentBuilder.Add("-an");
    argumentBuilder.Add(filename);

    startInfo.Arguments = String.Join(" ", argumentBuilder.ToArray());

    var _ffMpegProcess = new Process();
    _ffMpegProcess.EnableRaisingEvents = true;
    _ffMpegProcess.OutputDataReceived += (s, e) => { Debug.WriteLine(e.Data); };
    _ffMpegProcess.ErrorDataReceived += (s, e) => { Debug.WriteLine(e.Data); };

    _ffMpegProcess.StartInfo = startInfo;

    Console.WriteLine($"[log] Starting write to {filename}...");

    _ffMpegProcess.Start();
    _ffMpegProcess.BeginOutputReadLine();
    _ffMpegProcess.BeginErrorReadLine();

    for (int i = 0; i < buffer.Count; i++)
    {
        _ffMpegProcess.StandardInput.BaseStream.Write(buffer[i], 0, buffer[i].Length);
    }

    _ffMpegProcess.StandardInput.BaseStream.Close();

    Console.WriteLine($"[log] Writing of {filename} completed.");
}

Gluing all this together, I ran the app in its final form:

Running the console application capturing the Nest stream

Lo and behold, there was a new MP4 video waiting for me in the folder:

I’m now able to sleep well right before 2021, knowing that I can store my own video freely on my own machine. Up until the point Google makes a change to their APIs, rendering all the work above obsolete and useless.

Acknowledgments #

I’d like to express my sincerest thanks to the following folks:

Brandon McFarlin for providing some answers to less obvious questions around the ProtoBuf format that Nest was using through his extremely well-written TypeScript code.
Kerry Jiang for putting together a library to deal with WebSockets in C# that has pretty much all the tweaks one would need to get WSS requests working.
Mark Gravell for creating a protocol buffer library for .NET that removed a lot of the friction from the process.
Contributors to the foggycam project on GitHub - your feedback and issues over the years helped push me in the right direction.
Nest/Dropcam engineers for putting together a nice API to stream videos.

Conclusion #

This was a fun project to build - I’ve learned a lot about WebSockets, have a better grasp of the ProtoBuf format, and I finally pushed myself to get out of the comfort of the HTTP API bubble. I want to end this article by saying that this project is just that - an exploration space. You should not use it for any critical workloads, such as home security or proactive alerting, because it’s not stable, and might have unintended behaviors. Use at your own risk!