AtlasDB is a massively scalable datastore and transactional layer that
can be placed on top of any key-value store to give it ACID properties.
This is the first of several blog posts that will introduce AtlasDB and
describe how we built it at Palantir.
Building AtlasDB: the Inspiration
In 2010, Daniel Peng and Frank Debek of Google put out a paper entitled,
“Large-scale Incremental Processing Using Distributed Transactions and
Notifications“. The
paper describes a system in use at Google named Percolator, which sits
on top of BigTable,
Google’s distributed key-value store.
Google needed transactions with
ACID properties. They also needed
a highly fault-tolerant system to deal with the potential failure of key
parts of their system under load at massive scales. This drove them to
push the accounting data for the transactions into BigTable as well, as
BigTable already handled both replication of its data and fault
tolerance.
Unfortunately, due to the number of writes involved in the transaction
accounting, they saw an approximately 75% performance hit when using
transactions. Percolator was built to enable incremental updates to the
search indexes for the Google search engine, which was previously a
periodic batch process. In this case, the extra reliability afforded by
using BigTable to track the transactions was the important factor; even
though there was a significant performance hit (over using raw
BigTable), the overall performance was high enough to meet its design
criteria.
Meanwhile, at Palantir, we were hitting a similar obstacle. The
interactive analytic core of Palantir Gotham, which was originally built
with a traditional RDBMS as its backing store, was hitting the limits of
economical and easy scaling. We needed to move to a distributed backing
store for scalability, but we needed transactions to enable our
higher-level Revisioning Database to work correctly.
Percolator presented interesting possibilities, but with a 75%
performance hit, the latency would be too long for our users. We shoot
for a maximum ten second wait time when doing interactive
analysis—anything longer is an unacceptable interruption to our users’
investigative flow. Studying the Percolator protocol, our engineers saw
some places where design constraints could be relaxed to lower the
latency of each operation.
And so, the idea for AtlasDB was born. Now it was just a matter of
building it.
Designing AtlasDB
Understanding Data and Development at Palantir
We take a human-centered design approach to building software. Instead
of asking what technological problem we want to attack in isolation, we
ask, “What features and infrastructure would a human analyst need to do
his or her work?” To answer this question, we use a holistic
understanding of how low-level data integration, scalable data servers,
API layers, and an entire suite of user interface tools, when properly
integrated, create an efficient, frictionless user experience for
non-technical subject matter experts working with large-scale, disparate
data to solve real-world problems. It’s an over-arching user experience
(UX) play that decomposes into a lot of hard technical problems—similar
to building something as complex and seamless as an iPhone.
When components already exist that serve our needs, we are happy to use
them—our products use several of the high-quality open source
datastores, map-reduce frameworks, and search engines. But we build new
things whenever we identify a capability gap. Some examples of this in
past:
- The Revisioning Database, the heart of Palantir Gotham that enables
Nexus Peering
- Nexus
Peering,
a technology that allows a single Palantir Gotham instance to be
distributed, or for multiple instances to securely and safely
exchange data
- The Maps application, which
allows the viewing of geospatial imagery and also the visualization
and analysis of objects with associated coordinates
- Horizon,
an in-memory database that drives interactive querying over billions
of objects in interactive time, used to back the Object Explorer
application.
Scaling, Round 1: Federation and Palantir Phoenix
In 2005, when we first started building Palantir Gotham, there wasn’t
really a viable alternative to the RDBMS. The Revisioning Database, the
Palantir Gotham persistent datastore, was originally an implementation
of a special schema inside a SQL database. The SQL RDBMS performed well
for our users until up to about a terabyte of data. But as the size and
scope of Palantir Gotham-based analytic workflows pushed the database to
its limits, there were only two available options if we stuck with a
RDMBS:
- Get a larger, more powerful computer. This works, but the price of
computer hardware and advanced database software needed to support
that scale grows super-linearly (sometimes exponentially) with the
size of the data, making this approach really expensive, really
fast.
- Move to multiple computers and a sharded database
architecture.
While this can work well for certain database schema, our schema is
not well-suited to this approach. Sharding can also add a lot of
complexity to the application using it, leading to a more bug-prone
and fragile code base.
We didn’t like either of these options, so we began considering
non-RDBMS-based solutions. We started with a federated approach that let
us address much larger scales of source data without scaling the core.
We developed Palantir Phoenix, a petabyte-scale datastore that can run
map-reduce and other batch-oriented processes to filter and classify
data that needs to be reviewed by human analysts. By federating search
and storage of massive-scale data out to Palantir Phoenix, and importing
relevant results into Palantir Gotham on the fly, we could guarantee
analysts would still have all the data they need at their fingertips
without storing everything in the Palantir Gotham Revisioning Database.
For example, a cyber security workflow may include network flows data,
proxy logs, malware scanning results, inventory data, authentication,
and VPN logs. The vast majority of the data in these data sets are not
of interest to cyber security analysts—they overwhelmingly represent
legitimate traffic. But when something bad happens, such as malware
being detected on a user’s laptop, analysts can pull the relevant and
related narrow slices of data from Phoenix into Palantir Gotham to
determine the extent and severity of the
intrusion.
Using our Dynamic Ontology, data is mapped into the Palantir object
model and managed in three separate subsystems:
- a search server for fast full-text and wildcard searching;
- a Horizon server for top-down filtering and comparisons of large
sets of objects;
- the Revisioning Database for tracking changes to the data and
allowing analysts to work independently while also sharing analytic
insights with each other. (This is also where the metadata that
enables Palantir’s privacy and security
controls
is stored.)
While the size of the data that ends up in Palantir Gotham can be much
smaller than the total data size of an instance, it can still get pretty
big. Moreover, it doesn’t help that all of the housekeeping Palantir
Gotham does around the source of the data (e.g. the revisions and
security information) requires us to store 2-5x more information than
just the size of the initial imported source data.
Scaling, Round 2: NoSQL K/V stores
It soon became clear that we were going to need to replace the datastore
for the Revisioning Database. An obvious place to look for economical
scalability was a class of datastores dubbed,
‘NoSQL’.
NoSQL datastores use collections of commodity-class machines working in
concert, enabling engineers to build a distributed system capable of
scaling up to large data volumes and high performance with a smooth
price curve—add more machines, get more capacity. When we first built
the Revisioning Database in 2005, NoSQL systems offered intriguing
potential as an approach but were still plagued with performance, scale,
and most importantly, data loss and corruption problems. In the
intervening years, these early problems have largely been engineered
away.
Today, these systems underlie much of the modern web and are developed
and used by companies like
Google,
Facebook, and
Amazon).
Many of these use a key-value model (K/V), wherein a short, unique
identifier (the key) is used as a key to access an individual storage
cell (the value). The storage cell may hold a simple value or a larger,
complex data structure.
While NoSQL stores have great scaling properties, they don’t make great
guarantees about the consistency of the system. Since each node of the
system runs independently and any given value could lie on any node,
it’s impossible to know if any read of more than one node is consistent,
i.e., came from the same write. For many uses (e.g., updating fields on
a social network profile page), this property (called eventual
consistency) is not
a problem.
Unfortunately, for a system like Palantir Gotham, we’re not just storing
individual values but sets of related values that need to read and write
consistently, like many index entries along with a primary value. A lack
of consistent read means that any operation that uses values from
multiple keys can never be guaranteed to be correct.
Fortunately, implementing transactions can solve this problem by
providing four guarantees, referred to as ACID:
- Atomicity - every distinct operation of a transaction succeeds
or the state is rolled back as if the transaction never happened;
there is no way to partially complete the update of multiple fields
- Consistency - data is in a consistent state at the beginning and
at the end of a transaction
- Isolation - the work taking place inside a transaction is
invisible to any other operation taking place, so two transactions
can be run concurrently without interfering
- Durability - once a transaction commits successfully, the data
must have been written to non-volatile storage in such a way that it
won’t be lost in the event of a crash or power failure
Aside from the formal guarantees provided by transactions there is a
very practical consideration: without transactions, programmers have to
reason very carefully about consistency and write a lot of code to try
to manage it. As a result, development proceeds slowly and the code is
much more fragile to future changes. Pushing the consistency (and by
extension, correctness) logic down into a transaction layer is usually
the right separation of
concerns.
Setting the Stage for AtlasDB: A Transactional API for Key-Value Stores
![AtlasDB Architecture]()
The design of AtlasDB departs from Percolator in a few key aspects. By
taking key locking out of the main datastore and into a dedicated lock
server, the write overhead was lessened, increasing performance. Further
improvements were gained by allowing the transaction accounting table to
live in a datastore separate from the main scalable datastore. This
decoupling allows the transaction data to live in a system that gains
higher write performance in exchange for less scalability. Since the
transaction accounting data is quite compact, this is a huge win for
performance of the overall system. (We’ll cover the protocol and
architecture in-depth in a later post.)
The NoSQL-transaction revolution still required a few more developments
to make the burgeoning AtlasDB as engineer- and user-friendly as
possible. Along with the core changes to the transaction protocol and
system architecture, our team set about designing a system that could be
used with almost any key-value store. Rather being tied to a particular
key-value store, we decided to build an API layer that exposed
transaction primitives. The API layer (effectively just a code library)
along with a few lightweight network services created a system that
could be applied to any key-value store. The writing of the driver for
any new key-value store thus became a one-day task comprising, at most,
a few hundred lines of code.
This is a good idea for a few reasons:
- Deployment flexibility - An application built on top of AtlasDB
would always see the same API, allowing the key-value store to be
switched out for different levels of scale. Palantir Gotham needs to
run at scales as small as a single laptop, both as a development
environment for creating customizations and enhancements and in
order to support disconnected, self-contained operation for
environments that don’t have any networking infrastructure. It also
needs to operate at petabyte scale for large production systems.
Key-value stores that run at massive scale are usually somewhat
complex to setup and administer; the easy-to-set-up key-value stores
don’t scale.
- Pace of change in NoSQL - Though much more advanced than it was
a decade ago, the NoSQL world has not yet reached full maturity. The
capabilities, scale, and performance of different offerings is still
rapidly evolving. Keeping AtlasDB free from committing to a single
key-value store is essential, given the uncertainty around what the
best options will be even a year or two from now.
- Consistent API - The bulk of the team’s work in completing
AtlasDB was not building the transactional layer itself, but in
porting our extensive, existing data fusion platform over from SQL
to a key-value model of data storage. By abstracting the API into
the AtlasDB transactional layer, we are preventing having to port
the entire product to yet another API in the future—switching
datastores is as easy as writing a new K/V driver.
AtlasDB today
AtlasDB is now the default datastore inside of new production instances
of Palantir Gotham and is already being re-purposed for other use cases
that need both the scalability and consistency that it offers. Stay
tuned for part two of the AtlasDB series, where we’ll do a deep dive
into the system architecture and transaction protocol that make it go.