Building a Telemetry Stack for The Modern Power Company
How Base Power modernized our telemetry stack to lower costs and increase observability.
Background
Base Power provides Distributed Energy Resources (DERs) on the Texas grid by managing a large fleet of home backup batteries. These resources support the grid by providing power during times of high demand.
Grid-scale batteries are seeing wider adoption due to falling battery costs and increased penetration of non-dispatchable generation such as wind and solar. Power supply and demand have to be matched at all times. For example, solar stops producing when the sun goes down which coincides with the time people arrive at home, turn on the AC, cook dinner, and plug in their EV. This has created the infamous duck curve, where natural gas peaker plants have to surge to match the increasing demand and decreasing supply. Batteries allow us to time-shift energy by charging when power is abundant and discharging when power is scarce.
Centralized grid-scale batteries must pass through a multi-year interconnection queue. Base, however, uses distributed grid-scale batteries to quickly deploy storage assets on the grid by locating batteries throughout the distribution infrastructure.
Having real-time visibility into the state of the system is a key challenge when managing a fleet of distributed batteries. To effectively operate the fleet of batteries we need to know how power flows to the home and from the grid, batteries, and optional onsite solar panels. The largest utilities in Texas participate in Smart Meter Texas, which only provides 15 minute power granularity, whereas Base has two second visibility into power flows.
Defining a Modern Telemetry Stack
Base Power interviews start with a working session. Interviewees are asked to present on how they’d design a system that’s relevant to our current challenges. This is a useful signal because it lets us see how they’d apply their knowledge to aspects of our business. My working session was to design a telemetry stack for a fleet of distributed batteries.
The main challenge is getting the data off the device attached to the member’s home and into the cloud. If we don’t have access to the member’s Wi-Fi we have to rely on often unreliable and expensive 4G backhaul. We don’t want to lose data if the network is down for an extended time or if the device loses power.
My proposal:
Store-and-forward the data locally on device, send over encrypted proto/gRPC to a service in the cloud, and then from there write to a timeseries database. On the server you can optionally enrich the data and dual-write or publish to additional subscribers that need up-to-date telemetry.
The Existing State
When I joined I learned about the state of Base’s existing telemetry system, which was written quickly in the early days of the company as a proof of concept. Our edge software is hosted on a Raspberry Pi that reads metrics from the battery over Modbus. Previously it would package those metrics into a JSON object and publish them over MQTT. An AWS IoT Core rule would then write them to AWS Timestream.
There were a number of problems with the existing stack. JSON is easy to use but has a very verbose wire format that costs a lot to send over 4G. The metrics had no interface definition or compile time checks and so could be modified by fat-fingering a key name in JSON. We were using Timestream’s single measure record format which limited us to 100 metrics per JSON message. And finally we had unpredictable and high costs from Timestream, with no cost observability to determine what was driving the cost.
The telemetry stack wasn’t ideal but worked well enough that rewriting it wasn’t our top concern, we had too many other pressing priorities. However, keeping the end goal in mind and using it to inform how we built other components of the system, we were able to do a piecemeal migration to a new telemetry stack over a 6-7 month period.
Secure Device to Cloud Communication
One of my first projects at Base was to design a way of configuring our gateways. The gateways are Raspberry Pi devices connected to the battery that run golang software in containers. They are the interface between the cloud and the battery—forwarding telemetry from the battery to the cloud and sending commands to charge and discharge the battery. We needed a declarative way to define the configuration of the system: whether it’s a subpanel or whole home backup, if the member has solar, etc.
I set up a new gateway API ECS task that exposes a configuration endpoint for gateways to contact. I configured mTLS to mutually authenticate the cloud service to the gateway and the gateway to the cloud service, sharing the same certs we use for connecting to MQTT. This laid the foundation for a secure connection from the device to the cloud.
SQLite on the Edge
Devices receive power commands from the cloud to charge, discharge, or follow home load. These were stored in memory on the gateway and could be lost during container restarts (such as during a software update). We wanted the commands to persist until they expired.
I added SQLite to our gateways and created a volume that attached to the containers for storing the database files. We write commands to the database upon receipt and the gateway checks for the latest command in the database when it first starts. This gave us structured, durable on-device storage.
Proto/gRPC Telemetry Publishing
In the fall we implemented support for dual ground mount batteries. This required us to talk to and receive telemetry from two batteries at once. For dual battery systems we’d need to add a new dimension `role` to the Timestream metrics to allow us to differentiate between the individual batteries. This was further complicated by the fact that we were already at the limit of the number metrics we could publish in a single JSON object.
This called for a rewrite of our telemetry publishing and ingestion. I defined a proto file for representing all of our device metrics. We already had the gateway API and so I wrote a new gRPC service endpoint for reporting telemetry to replace the IoT Core rule we were using. Its request object wraps the serialized telemetry proto and includes an enum indicating what type of telemetry is being reported. The server uses proto reflection to iterate over all proto fields and prepare them for ingestion into Timestream.
We had configured MQTT store-and-forward and needed to maintain that behavior to prevent us from losing telemetry during network interruptions. Fortunately we already had durable storage on-device in the form of SQLite. We wrote a little SQLite wrapper that first stores to the database, then publishes the telemetry to the cloud, and finally deletes the database row if that was successful. If not successful, a background process periodically looks for unpublished telemetry and attempts to send it to the cloud.
This allowed us to add additional device metrics that had been blocked by the 100 measure limit and quickly launch dual battery systems. We were now in full control over our telemetry stack.
Datastore Migration
Finally it was time to reconsider a new datastore for our timeseries data. Our AWS Timestream bill had grown quickly and offered very little in the way of observability, so we had no idea what was driving our costs. Furthermore it was lacking in features you’d expect from a modern database. Considering our use cases, I realized we would be better served by a data warehouse type application. We looked at BigQuery, Snowflake, Clickhouse, and others, but settled on BigQuery because I had the most experience with it which would allow for a quick migration.
We extended our Terraform to work with GCP and turned up a basic integration using Workload Identity Federation to authenticate our AWS resources with Google Cloud. We built our own schema syncer to evolve our table schemas in safe ways (no deleting columns, no changing column types) and integrated it into our CICD pipeline. With that all setup, we were able to start dual writing to Timestream and BigQuery.
We partitioned our telemetry table by time and clustered on the gateway ID to minimize query costs. A key goal in the migration was to have better cost observability, so I created a Grafana dashboard that queries the INFORMATION_SCHEMA tables to calculate query costs on a per-dashboard basis. I modified our query wrapper to read the gRPC service and method from the golang context and add them as tags as part of the BigQuery request so that I could also aggregate costs per service.
Summary
Over the course of 6-7 months we did a piecemeal rewrite of our entire telemetry stack. The foundations were laid as necessary as part of other development work and delivered incremental value along the way. We were successful in lowering our costs, increasing reliability, and improving observability for both query performance and cost.
We now have a telemetry system that continues to deliver two second power flow data and will scale as we exponentially grow our fleet of distributed grid-scale batteries.
About the Author
Andrew Hitchcock works on the fleet team which is responsible for managing our fleet of batteries and the software that runs on-device at members’ homes. Before Base he launched Amazon EMR, worked as an SRE at Google, and built BigQuery’s realtime storage optimization system. In his free time he larps as a rancher and enjoys spending time with his two longhorns.
If you want to work on projects like this from start to finish, join our team by clicking the button below.