“Our dependency tree keeps getting bigger, and each dependency is emitting more logs. The extent of those logs not following a common schema radically impacts their usability”

Distinguished engineer John Hart of the Event DB team at DataSet joins Lee Atchinson to speak about everything from observability trends, machine learning, dogfooding DataSet, and more.

Listen to the full podcast episode here on Software Engineering Daily. Read on to see Q&A summary and highlights of the conversation:

Q: What does DataSet do?

JH: DataSet is a unified source for server observability, trace, logs, and metrics. There are many solutions that address each of these in isolation, but DataSet brings them all together.  The ability to click from a metric-based “latency is high” alert to the tracing spans that show those operations in context, all the way to the individual line-level application logs, without having to leave the tool, is super important.  We sometimes refer to this as “MTTWH” - mean time to “what the heck?” (although sometimes we use a different final character...)  Having to switch tools for different levels of detail is needless friction.

You can go the other direction as well - from looking at an application log that contains a numeric value, it’s just one click to chart that over time and another click to create an alert or dashboard based on it.

Q: Sounds like a lot of data.

JH: For sure, and that’s not trending down anytime soon. Kubernetes control-plane data is verbose just by itself, and that’s before you get to the actual workload logs that you care about. DataSet’s architecture is fairly unique as far as I know - we avoid global indexes that would write-amplify our data and we separate compute from storage. Competing solutions that are based in the Solr/Lucene/Elastic document-indexing world depend on locally-attached storage, and therefore must scale their compute linearly with data. In other words, moving from 1 month to 1 year of storage would mean lighting up 2x the amount of compute (each with its own locally-attached storage).

DataSet separates compute from storage, so most of our cost is determined by daily volume rather than total-data-stored. This enables some cool features like pay-per-query for historical data, which lets customers leave their data in our system at very low cost for as long as they’d like. Because our at-rest format is columnar, we get great compression rates and it can actually be cheaper to leave data in DataSet than to keep it directly in your own storage system in standard row-major format. Plus you get the ergonomics of the entire tool, without having to manage hot/cold/glacier storage migrations.

Q: How has the log analytics space changed dramatically?

JH: With ever-increasing volume comes a need for a standardized view of your data, so I’d shout out to OpenTelemetry (and its predecessors) as maybe the biggest revolution in the observability space over the past decade.  Everyone’s dependency trees keep getting bigger, and each dependency is emitting more logs. The extent of those logs not following a common schema radically impacts their usability.  The standardized approach of OpenTelemetry really helps engineers operate third party systems reliably without having to become an expert in those systems’ logfile formats.

Q: Can you use machine learning for logs to find larger or time consuming patterns?

JH: For sure, and we’re putting a lot of effort into this.  DataSet recently added anomaly detection, so the system can detect spikes/gaps without needing manual thresholds.  I think we’ve all been through the cycle of creating an alert and then tuning it over days/weeks to eliminate false positives while still flagging actual problems … it’s annoying, time-consuming, and can be difficult to get right for data with built-in seasonality, diurnal/nocturnal patterns, etc.  This is the type of thing that ML excels at, so it’s nice to offload that problem to the computer.

Q: How is your team dogfooding product at DataSet?

JH: That might be my favorite part of this job - we use DataSet constantly in the development and operation of DataSet.  Any new feature we are developing is a feature we ourselves benefit from.  That’s about as tight of a feedback loop as you can get between your code & your tools, unless you’re coding a text editor.

DataSet is hosted and multitenant, typically with just one cluster per geographic region.  In our largest cluster, this means a single query will fan out to tens of thousands of CPU cores, all acting in concert to search TBs or PBs of data as quickly as possible.  To run clusters of this size we absolutely depend on DataSet to monitor them, for steady-state operation (alerts and dashboards) as well as ad-hoc queries, debugging, reasoning about the system … we couldn’t do it without DataSet.

Fun Fact: John was one of DataSet’s first ten customers (back when it was called Scalyr).  He came across this blog post written by Scalyr’s founder, Steve Newman, and after trying Scalyr found it much preferable to Splunk. John liked the product so much he became an early employee of DataSet and now runs the database team, which powers DataSet as well as SentinelOne’s security products.

Get Started with DataSet for Free

DataSet is a modern log analytics platform that helps DevOps, IT engineering, and security teams get answers from their data across all time periods, both live streaming and historical. It’s powered by a unique architecture that uses a massively parallel query engine to provide actionable insights from the data available.

Get started with our 30-day free trial here.