So far, it\’s RoR

Ruby on Rails, PostgreSQL, SVN, etc….

Database scaling terminology

Posted by Chirag Patel on August 16, 2007

Some quick explanations on designing scalable databases (thanks to Jon Sime of the PostgreSQL mailing list and Media Matters for America)

Horizontal vs Vertical scaling

Horizontal scaling is merely a general concept which, in its simplest form, means “adding more nodes” as opposed to vertical scaling which (very roughly) means making an individual node as powerful as possible. The two are not mutually exclusive (you could add more nodes, and make each one as powerful as possible — if you’ve got the money), and the concept is not limited to databases.

Horizontal scaling: 3 types of “clustering”

  1. Partitioning (most likely pretty relevant when talking about terabytes of data) is a feature which stores the actual data of a logical table into any number of physical tables (within the same database), each of which store an explicitly defined subset of the data.
    • Note: Partitioning is referred to as clustering in O’Relly’s Building Scalable Web Sites. The book also refers to clustering and federation as types of partitioning. In other words, the terms partitioning and clustering are interchanged.
  2. Federation typically refers to a situation where you have multiple datastores (which can be different databases on one machine, entirely different servers, various mixtures of storage formats, etc.) which each store and are responsible for subsets of your data, with a middleware layer on top of them that mediates access and updates to the underlying and disparate repositories. A certain software package also uses the term to refer to individual non-local tables stored on a single remote server which are presented as if they were local tables.
  3. Replication generally indicates a setup where data is managed across a set of two or more nodes, each node either storing copies of all the data locally, or at least able to access the non-local portions transparently from another source when necessary. Each node in the cluster should be able to present a consistent version of the entire dataset to connected clients (i.e. if you connect to two different nodes and issue the same query at the same time to each, you would get the same data back from both). The issue of propagating all changes to the data from one node to all others in a fail-safe and consistent way makes replication generally very complicated. Specifically in the context of an index, replication can also mean ordering the data in a table based on the values of an indexed column.
    • Note: Replication was originally referred to as clustering by Jon Sime

If anyone has an disagreement (or agreement) of these term’s please leave a comment. I would love to hear your feedback!


One Response to “Database scaling terminology”

  1. Chirag Patel said

    Check out these flickr links

    “Lessons learned” section here is especially good:

    Web video

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: