Table of contents
- We have to go deeper
- Understanding the new auth
- Exploring WebSockets and exchanged data
- Getting the video frames
- Building the final video
Way back in 2018, I coded up a little project that allowed me to record my Nest camera stream in a very hacky way. I wanted to get the raw video off of the camera without paying for a Nest Aware subscription. The Nest Aware subscription did offer quite a few interesting things, but I didn’t need 90% of them - I just wanted to get a constant recording of my house. Unfortunately, there did not seem to be any API available for this purpose, and the only alternative was making the camera public, and then using the streaming endpoint to get the feed, which is obviously a very insecure way to deal with an indoor camera. I mean, how much do you like the possibility that a stranger can at any point peek inside your living quarters?
At the time, I ended up using an undocumented API that was available through the Nest website, that allowed me to get a low-resolution snapshot of the camera stream. I would grab a bunch of JPEGs and then would stitch them into something that resembled a video with the help of
ffmpeg. The code to do that was an ungodly mess. The thing I am most proud of in that implementation was the fact that I did proper authentication, and even had ways to handle two-factor tokens. Over time, Google shifted how they do authentication (making users rely on Google accounts, which is a sensible decision), and so they broke
foggycam. But even without this change, the quality of the video that was produced was very poor. I shelved the app for some time.
Now, like the phoenix rises from the ashes,
foggycam rises and becomes better and more useful. I finally got to spend some time investigating what I could do to make the application produce higher-quality output (also known as “video that looks like it came from a real camera”), and I think I finally found a way to do that.
We have to go deeper
When I first built
foggycam, I noticed that the Nest website used WebSockets to channel the stream to my browser. Given that at the time (we’re talking 2 years ago) debugging WebSockets was a bit painful, I wanted to see if there is a way for me to get the stream without digging into the world of TCP & UDP, so I started with HTTP APIs first. However, I quickly realized that in order for me to get something useful out of the project, I will need to deal with sockets eventually because there was just no way to match the 1080p quality through a HTTP API. Other priorities got in the way, so I never got to actually dive too deep into what WebSocket sorcery Nest was doing.
But, right before rolling into 2021, I figured that maybe I make another attempt at solving the problem, and take a look under the hood. I logged in to
https://home.nest.com and tried inspecting what kinds of APIs Google used this time.
Understanding the new auth
From what I’ve managed to inspect, seems like the bulk of the API surface is the same, with the only major change being the Google authentication. More than that, I’ve noticed that in a lot of API calls there is a token being passed to the
Good enough, but that still doesn’t answer how to actually obtain said token. Well, it turns out there is another call made up the stack to the following URL:
Yes! I am dealing with a JWT token here, that hopefully should be a breeze to get, if only I can trace its origin. The call made to the Nest API above is a POST request with the following payload format:
Cool, halfway there. I now still need to understand where I can get the Google’s OAuth token. Luckily, right before the call to
/issue-jwt, there is a call to the following URL:
Jackpot - looking at this response, I am able to extract the information necessary:
For the new release of
foggycam, I decided to go with .NET and C#, specifically. Because I knew that I will rewrite the entire thing with WebSockets in mind, I thought I would experiment with the platform I am probably most comfortable in. The entire token acquisition function ended up being this:
I tried to mimic my browser in all these requests as much as possible to make sure that I am not tripping up some checks that validate the kind of client that is accessing the APIs. Upon execution of the call above, I am able to work with a clean token! I think I can now execute Nest API calls with the JWT token in hand.
To test this theory out, I built out a function to perform a call to the API that queries available cameras - after all, I did not want to hardcode a camera ID:
Sure enough, this returned me the list of cameras I have available:
The easy stuff was out of the way - I had the auth done, and I had access to the required tokens. Now, I needed to deal with the actual WebSocket stream.
Exploring WebSockets and exchanged data
It was relatively easy to see that WebSockets were in use, because I could just pop open the
WS tab in Firefox:
But analyzing those is extremely painful - everything sent to and from the service is in binary format and analyzing binary WebSocket payloads through any browser is shockingly complicated. By that I mean that you are better off guessing what’s inside the payload yourself by rolling the dice instead of trying to read the binary output from either Firefox or Chrome. Wireshark was somewhat useful in this domain, but there was a lot of trial-and-error in actually setting up the socket content capture. The tool that I ended up using to debug this was Fiddler - the classic version was really good for tracing WebSocket payloads.
It actually showed the binary content sent over the wire in HEX form, just like I wanted. This is as close to perfection as I was going to get. I started tracking what packets were being sent back and forth, and I noticed something interesting - there is an intro packet, that seems to contain the camera ID, along with the authentication token that I obtained earlier, after which Nest would send an ACK (OK) request. After that, another request was sent to kick-off the broadcast, after which the Nest service would stream back what I can only assume were individual video frames.
Digging through the format a bit more, it clicked - they’re using protocol buffers. Still, doesn’t tell me much about the exact format, but it’s a start. I began searching for other projects that maybe have done something similar, and that’s when I stumbled across
To make my life even easier, I leveraged the work of Mark Gravell, software engineer at Stack Overflow, who created a .NET library for protocol buffers -
protobuf-net. This library provides a very nice level of abstraction, through which I can create C# classes and decorate them with the necessary attributes to make them serializable. Like this:
Unlike Brandon, I did not need to implement a comprehensive camera API connector. I just wanted the simple video stream (don’t even care about audio at this time) that I can drop in a local MP4 file. With the models defined, it was time to try and send the data to Nest and see if the service will talk back. To do that, I needed to have a way to talk WebSocket. And again, I leveraged the work of another awesome developer - Kerry Jiang, who put together
I wrapped the connection setup function in a way that initializes the socket and sends the first batch of “hello” data:
One important call-out here - at first, I was getting a lot of certificate errors, and I thought that Fiddler, my system proxy for debugging purposes, was interfering with the TLS handshake. As it turns out, I did not have the port specified in the
target variable. Without the port, the socket connection will yell and curse at you. With the port, it connects just fine. I should’ve looked at the browser WebSocket requests, because then I would’ve learned that there is actually a
:80 suffixed to the destination. You live and you learn.
So here is the kicker, and this made debugging really hard. I created a custom function,
PreformatData, that would encapsulate the message information in an envelope that Nest services should understand (or so I thought):
Basically, every packet that was sent needed to have a block of bytes that define its length after the packet type (which is the first byte entry in the buffer). A shout-out to Brandon here, as I realized this was a piece I was missing in my socket code. As I was fiddling with these values, I could not for the life of me figure out why the Nest service was not responding, despite the fact that I successfully connected to it. As it turns out, Nest services will not respond at all if they do not understand the message. No error or alert, the message will just disappear and you will never know.
Fiddler came to the rescue, as I was able to compare what was sent over the wire relatively quickly, thanks once again to the ability to view the HEX-formatted payloads. In my code, two lines saved me:
This allowed me to format my message in a way that enabled comparison with the payloads in Fiddler. What was the issue when I first started sending payloads? I messed up the endianness when defining the message length. With a quick
Array.Reverse, I was back on track and was able to compose the message correctly. Nest gave me a friendly wave back when I send the first “hello” authorization packet. I’m in business!
Getting the video frames
Me being authorized against the Nest socket server was one piece of the puzzle, but I needed to now address the last challenge - kick off the stream. To do that, I needed to send a ProtoBuf-encoded payload that starts it, and it could be done with the following function:
What I am doing here is creating a
StartPlayback object that contains the session ID (which can be a random number), the primary camera profile (that determines what kind of stream quality I am processing), and additional profiles that exist for the camera (extracted from the camera description API call earlier), and then sending it to the socket against which I am already authenticated. The moment that is done, I started receiving packets back. In turn, those packets contained all the video frames that I needed to start assembling the final output. Specifically, upon deserialization, every packet would have a
payload byte array that I can use to create the final MP4 file.
Extracting the right information from the received packet is done by
ProcessReceivedData, that takes the byte array of the entire payload, removes the packet type and the length, and then hands off the processing to
HandlePacketData. In this case, there are two packet types - the long one, that needs 4 bytes to store the length, and the short one, which only uses 2.
HandlePacketData is just a collection of
switch statements that determine what I need to do with the packet once it arrives:
HandlePlayback is where the rubber meets the road:
This function does the last bit of processing legwork - it deserializes the packet, appends a H.264 header, and adds it to a generic list of byte arrays (yes, I know it’s not very efficient) that is later used to create the composite video.
Building the final video
Last thing - I need to take all those captured frames and compile them in a MP4 file, like I mentioned earlier. To do that, I run a loop in my console application (again, not very efficient):
ffmpeg to post-process the raw data from the standard input, and outputs it to a file that is timestamped right in the name for easy consumption:
Gluing all this together, I ran the app in its final form:
Lo and behold, there was a new MP4 video waiting for me in the folder:
I’m now able to sleep well right before 2021, knowing that I can store my own video freely on my own machine. Up until the point Google makes a change to their APIs, rendering all the work above obsolete and useless.
I’d like to express my sincerest thanks to the following folks:
- Brandon McFarlin for providing some answers to less obvious questions around the ProtoBuf format that Nest was using through his extremely well-written TypeScript code.
- Kerry Jiang for putting together a library to deal with WebSockets in C# that has pretty much all the tweaks one would need to get WSS requests working.
- Mark Gravell for creating a protocol buffer library for .NET that removed a lot of the friction from the process.
- Contributors to the
foggycamproject on GitHub - your feedback and issues over the years helped push me in the right direction.
- Nest/Dropcam engineers for putting together a nice API to stream videos.
This was a fun project to build - I’ve learned a lot about WebSockets, have a better grasp of the ProtoBuf format, and I finally pushed myself to get out of the comfort of the HTTP API bubble. I want to end this article by saying that this project is just that - an exploration space. You should not use it for any critical workloads, such as home security or proactive alerting, because it’s not stable, and might have unintended behaviors. Use at your own risk!
You can check out the project on GitHub.