5 Common types of Data Stores
Choosing where to store your data is an essential part of any technical design. Does this conform to a schema, or is it structurally flexible? Does that have to stick around forever or is that temporary?
We will define five common data storage and their attributes in this article. We hope this information will give you a good overview of the various data storage options so you can make the best possible technical design choices.
As with the original data store, databases are like that. We began to need to store data when we stopped treating machines as glorified calculators and started using them to meet business needs. And so in 1963 we (and by us, I mean Charles Bachman) developed the first database administration system. Such database management systems had become the Relational Database Management Systems (RDBMSs) we know and love today by the mid to late 1970s.
A relational database, or RDB, is a database that is organized into tables using a relation model of data. Data. Each table has a schema that sets out the columns for that table.The rows of the table, which each represent an actual record of information, must conform to the schema by having a value (or a NULL value) for each column.
Each row in the table has a unique key of its own, also called primary key. This is typically an integer column called "ID." A row in a different table could reference the ID of this table, thus creating a relationship between the two tables. When a column in one table corresponds to the primary key of another row, we call that a foreign key. Using this idea of primary keys and foreign keys, we can use incredibly simple concepts to describe incredibly complex data connections.The industry standard language for communicating with relational databases is the SQL, which stands for structured query language.We use MySQL as our RDBMS at Shopify. MySQL is robust, reliable and enduring.
When to Use a Relational Database
Use the database to store information critical to your business. Databases are the most enduring and reliable data storage type. Everything you need to securely store should be going into a folder.
Relational databases are typically the most mature databases: they have been able to withstand the test of time and remain an industry standard tool for reliable storage of key data.
Your data may not be nicely conforming to a relational scheme or your schema changes so often that the rigid structure of a relational database slows down your growth. In that case, instead, you might consider using a non-relational database.
Non-Relational (NoSQL) Database
Over the years, computer scientists have done such a good job of developing databases to be accessible and secure that we have started to want to use them also for non-relational data. Information that doesn't specifically confirm some schema or that has such a variable structure that attempting to represent it in relational form would be a huge pain.
Such non-relational databases are also called databases named "NoSQL." These have approximately the same characteristics as SQL databases (sustainable, resilient, persistent, replicated, distributed, and performant) except for the big difference of not implementing schemes (or only enforcing very loose schemes).
A paper store is basically a fancy key-value store where the key is often omitted and never used (although under the hood one is assigned— we usually don't care). The values are blobs of semi-structured data such as JSON or XML, and we treat the data store like these blobs are just a huge array of them. The document store's query language will then enable you to filter or sort based on the content inside those document blobs.
MongoDB is a popular document store you'd heard of.
Wide Column Store
A wide column database between a paper store and a relational DB is somewhere in between. It still uses tables, rows, and columns as a relational DB but column names and formats that vary for different rows in the same table. This strategy integrates the strict table structure of a relational database with a document store's versatile content.
Cassandra and Bigtable are popular broad column stores you may have heard of.
For some streaming events we use Bigtable as a sink in Shopify. Other stores of NoSQL data aren't widely used. We consider that most of our data can be relationally modelled, so we usually stick to SQL databases.
When to use a NoSQL Database
The most suitable non-relational databases are for handling large volumes of data and/or unstructured data. In the big data world, they're extremely popular because writes are easy. NoSQL databases do not implement complex cross-table schemas, so it is impossible that writes will be a bottleneck in a NoSQL environment.
Non-relational databases give developers a great deal of versatility, so they are often popular with early-stage startups or greenfield projects where the exact specifications are still not clear.
Another way of storing non-relational data is in a store of key values.
A key-value store is basically a hashmap of the production scale: a map from keys to values. There are no fancy schemas or data relations. No tables or other similar kind of logical data classes. That is just keys and values.We use two key-value stores at Shopify: Redis, and Memcached.
Redis as well as Memcached are key-value stores in memory, so their performance is top-notch. Since they are in memory, they (necessarily) support configurable eviction policy. We will eventually run out of memory for keys and values to be stored, so we'll have to delete some.
The most popular strategies are Least Recently Used (LRU) and Least Frequently Used (LFU). These eviction policies make key-value stores an easy and natural way to implement a cache.
One major difference between Redis and Memcached is that Redis supports some data structures as values. You can declare that a value in Redis is a list, set, queue, hash map, or even a HyperLogLog, and then perform operations on those structures. With Memcached, everything is just a blob and if you want to perform any operations on those blobs, you have to do it yourself and then write it back to the key again.
Redis can also be configured to persist to disk, which Memcached cannot. Redis is therefore a better choice for storing persistent data, while Memcached remains only suitable for caches
When to use a Key-Value StoreKey-value stores are good for simple applications which need to temporarily store simple objects. A glaring example is a cache. A less obvious example of this is using Redis lists to queue work units with simple input parameters.
Full-Text Search Engine
Search engines are a particular type of data store designed for a specific case of use: search for text-based documents.
Technically, the search engines are stores of NoSQL data. Instead of storing them as-is and using XML or JSON parsers to extract information, you ship semi-structured document blobs into them, the search engine slices and dice the document content into a new format that is optimized for searching based on long text field substrings.
Search engines are enduring but are not designed to be particularly long lasting. Never use a search engine as your primary store of info!
For our full-text quest we use Elasticsearch on Shopify. Elasticsearch is repeated and distributed out of the box, simplifying the size.
However, the most important feature of any search engine is that it performs outstandingly well for text searches.
To learn more about how full-text search engines achieve this speedy performance, you can check out StarCon 2019 lightning talk by Toria.
When to use a Full-Text Search Engine
If you've found yourself writing SQL queries with a lot of wildcard matches (e.g. "SELECT* FROM products WHERE description LIKE' percent cat percent' to find cat-related products) and you're thinking about brushing up your natural-language processing skills to improve the results... you might need a search engine!
Search engines are also quite effective at scanning and filtering by correct text matches or numerical values, but databases are also fantastic at that. A full-text search engine's real value add is when you need to look for specific words or substrings within longer text fields.
The last form of data store you may want to use is that of a message queue. You might be surprised to see message queues on this list as they are considered to be more of a data transfer tool than a data storage tool, but message queues store your data with as much reliability and persistence as some of the other tools that we've already discussed!
For all of our streaming needs we use Kafka at Shopify. Payloads called "messages" are inserted by "producers" into Kafka "topics." On the other end, Kafka "consumers" can read messages from a topic in the same order in which they were inserted.
Typically, Kafka is treated as a message queue, and rightly belongs in our message queue section, but technically it is not a queue. It is technically a distributed log, which means we can do things like setting a "forever" data retention time and compacting our messages by key (which means we only retain the latest value for each key) and basically we have a key-value document store!
Although there are some valid use cases for such a design, a message queue is probably not the best tool for the job if what you need is a key-value document store. If you need to ship some data between services in a way that is quick, reliable, and distributed, you should use a message queue.
When to use a Message Queue
When you need to store, queue, or transfer data temporarily, use a message queue.
If the data is very simple and you're just storing it in the same service for later use, you may suggest using a key-value store like Redis. If it is very important data you might consider using Kafka for the same simple data, since Kafka is more robust and persistent than Redis. You might also suggest using Kafka for a very large amount of simple data, as the introduction of distributed partitions makes Kafka simpler to scale.
Kafka is often used for transmitting data between services.The producer-consumer model has a big advantage over other solutions: because Kafka itself acts as the message broker, you can simply ship your data into Kafka and then the receiving service can poll for updates. If you tried to use something more simple, like Redis, you would have to implement some kind of notification or polling mechanism yourself, whereas Kafka has this built-in.
These are not the be-all-end-all stores of data, but we believe they are the most common and useful. Knowing these five types of datastores will get you on the way to making great decisions about design!