Tuning Meteor Mongo Livedata for Scalability

New performance tuning parameters have shipped in 1.3

Meteor Software
Meteor Blog

--

Originally published by Nick Martin on the Meteor Blog:

One of the most frequent questions I am asked about Meteor is “how does it scale?” The answer is always the same: “it depends on your data access patterns.” In this article, I will explain the various modes Meteor has for running realtime mongo queries, how to pick which is right for your data access pattern, and the new parameters available in Meteor 1.3 to tune the system for maximum performance.

TL;DR — I just want to know the new options

Use the `disableOplog: true` option to `collection.find()` to opt-out of oplog tailing on a per-query basis. Use the `pollingInterval` and `pollingThrottle` options to the same function to adjust how often the poll-and-diff driver queries the database.

How does Meteor turn mongo into a realtime database?

Before getting to how to tune the data system, it is important to understand how Meteor queries MongoDB and provides a realtime publish/subscribe API to clients on top of a a database that doesn’t have any built-in realtime features.

There are two main modes of operation: poll-and-diff and oplog-tailing. Poll and diff is the simplest, can be used on any mongo database, and works well for single-server operation. Oplog tailing is more complicated, and provides real-time updates across multiple servers.

Poll and diff

The poll-and-diff driver works by repeatedly running your query (polling) and computing the difference between new and old results (diffing). The server will re-run the query every time another client on the same server does a write that could affect the results. It will also re-run periodically to pick up changes from other servers or external processes modifying the database. Thus poll-and-diff can deliver realtime results for clients connected to the same server, but it introduces noticeable lag (the default is 10 seconds, see below for more on this parameter) for external writes. This may or may not be detrimental to the application UX, depending on the application (eg, bad for chat, fine for todos).

This approach is simple and and delivers easy to understand scaling characteristics. However, it does not scale well with lots of users and lots of data. Because each change causes all results to be refetched, CPU time and network bandwidth scale O(N^2) with users. Meteor automatically de-duplicates identical queries, though, so if each user does the same query the results can be shared.

Oplog tailing

Oplog tailing — first introduced in Meteor 0.7 — works by reading the mongo database’s replication log that it uses to synchronize secondary databases (the ‘oplog’). This allows Meteor to deliver realtime updates across multiple hosts and scale horizontally.

The oplog is global per database cluster and requires admin permissions to access, so many users never set it up and rely on poll-and-diff for realtime updates. To enable oplog tailing, pass the `MONGO_OPLOG_URL` environment variable to the meteor process. When this environment variable is passed, Meteor defaults all queries to use oplog tailing. Before Meteor 1.3, this was all or nothing — new in 1.3 is the `disableOplog` option to `collection.find()` that allows tuning this on a per-query basis.

Since each meteor process must read the whole database log, it is desirable to have fewer larger servers instead of more smaller servers when using oplog tailing. This minimizes the amount of duplicated work and saves CPU and and network bandwidth with mongo.

Additionally, the oplog driver sometimes needs to re-fetch items from the database. These are documents that are not currently in the results, but could be as a result of an incoming modification. When it does so, it does a fetch on the document by `_id`, which is guaranteed to be indexed, but still can result in large loads on the database in some cases. Complex queries, especially those with limit or sort parameters may be more efficient with poll-and-diff. See below on how to disable oplog tailing on a per-query basis.

Tuning for your data access patterns

As you can see, each method has its tradeoffs. In Meteor 1.3, you can switch between them on a per-query basis. So how do you decide which one to use for a particular query?

The first major choice you have is do you want to use oplog tailing for any of your queries. If you are on a shared Mongo database and do not have admin access, oplog tailing is not an option. If you have a database with a very high write rate enabling oplog tailing might not be a good idea — whether or not you query a particular collection, Meteor must read the whole oplog if you have even one oplog driven query.

As a general guideline, it’s usually best to start with oplog tailing if at all possible, and selectively disable it on problematic queries. To enable oplog tailing in production, pass the `MONGO_OPLOG_URL` environment variable to Meteor at process start time (See Setting Up MongoDB Oplog Tailing for more details on setting up your oplog).

If you can’t turn on oplog tailing for whatever reason, you can skip the next section and go straight to tuning poll-and-diff.

Tuning oplog tailing

There isn’t actually much tuning to be done about how the oplog driver operates. Oplog tailing works in most use cases, but there are a few different ways it can go wrong when scaling. One of the most common tasks for making a Meteor app scale is to disable oplog tailing on particularly troublesome queries.

To disable oplog tailing on a per-query basis, add the option `disableOplog: true` to a `collection.find()` call in a publish function on the server (or any other server side observe call).

There are a few different ways oplog tailing can end up as a scaling bottleneck:

  1. Large bulk updates
  2. High write rate on individual documents (“hot spots”)
  3. Queries that cause the oplog driver to fetch many documents by `_id`

These will manifest as either high CPU load in the meteor application server, or high load and lots of bandwidth on the MongoDB server.

Large bulk updates — when something in your system does an update, insert, or remove that affect a lot of documents at once — can be problematic for oplog tailing. Each document that is modified is a separate entry in the oplog that must be read and processed. Typically the user doesn’t care about the order of the updates, or seeing the intermediate states. With poll-and-diff, the user will see batched updates. With oplog tailing, the user will see each modification as a separate update, which can be very expensive in terms of bandwidth and CPU. If you do large bulk updates and see increased latency to your users correlated with these updates, try disabling oplog tailing on all queries to the collections which receive bulk updates.

Even without large batch updates it is possible to overwhelm your application servers with many writes to documents read by many users. In general, if you have queries whose documents receive high-volume updates and the end user doesn’t care about receiving individual updates (as opposed to batch updates), consider disabling oplog and using poll-and-diff for those queries.

Overwhelmed application servers can be detected by looking at CPU usage. Because node applications are single-threaded, a single application server hitting 100% usage for several seconds can result in user visible latency. Since all applications servers are reading the same oplog, a spike in oplog writes will affect all servers at approximately the same time, which can exacerbate user perceived latency. And even if your average CPU usage is low, spikey CPU usage can still cause latency.

Occasionally we see applications having problems with user-reported latency, but the average CPU load on the application servers does not show overloading. Because many Meteor workloads are bursty, it’s possible that an application server process is hitting 100% CPU usage and causing user latency, but only for brief intervals. Here’s a Linux shell snippet I’ve used to look at process CPU usage every second — if you see multiple seconds in a row of 100% usage, its a good indication your users are seeing increased latency for some requests.

top -b -n 600 -d 1 -p <PID of node process> | grep -v ,

It is also possible for oplog tailing be a scaling bottleneck without maxing out application server CPU. Because each oplog entry does not always contain all the information needed to determine if a particular query needs to be updated, Meteor will sometimes fetch individual documents from Mongo by `_id`. This is guaranteed to be indexed and usually is not the main source of load on Mongo. However, with some query patterns, it is possible that thousands of simultaneous queries are issued adding notable latency to some requests. To illustrate why this is required, consider the query: `Animals.find({type: “dog”, color: “blue”})`. When an oplog entry saying `setanimals.SomeId.color` to “blue” comes in, Meteor does not know whether the animal in question is a dog, so it must fetch `animals.SomeId.type` from the database.

Note that this is dependent on both the write pattern coming from the oplog and which queries the Meteor app is observing at any given time. For example, queries with sort and limit (such as “high score” queries) can be problematic as they change what information the oplog tailer needs to be able to determine if the result set has changed. Querying by `_id` is the best; the oplog tailer can be very efficient when all or most queries are by `_id`.

Typically when this issue arrises, it manifests as high latency and deteriorating performance without hitting 100% CPU usage in the Meteor application server. Check your MongoDB server metrics and look for Mongo saturation (excessive CPU, network usage, or query rate). If your MongoDB provider does not expose an interface for this, you can also try using the Mongo query profiler to get a sense of what is running.

There is one user-tunable parameter for oplog tailing — the `METEOR_OPLOG_TOO_FAR_BEHIND` environment variable. This is a safety belt to help control cases where large bulk updates or app server saturation cause Meteor to fall behind in oplog processing. If the queue of unprocessed oplog messages gets too big, the server will “give up” and inititate a poll-and-diff cycle in an attempt to catch up with the oplog. The default value is 2000. If you have bursty writes, or regularly saturate your CPU, you may want to try adjusting this up or down. That said, I have not actually seen this make a huge difference in production apps — usually the solution to oplog scaling issues is to change how the oplog is used by altering queries or disabling oplog for certain queries.

Tuning poll-and-diff

Unlike oplog tailing, poll and diff has a couple of easy knobs you can turn to tune performance. These are two time-constants where the appropriate values depends entirely on the application: `pollingInterval` and `pollingThrottle` .

Polling Interval

The `pollingInterval` parameter to a `find()` query controls how often each Meteor server will poll the database for changes from other servers or external processes. The default is 10 seconds. The smaller this number, the more work each meteor server will do and more load there will be on the database. The tradeoff is the higher you set the number, the longer it can take for changes made by one client to reflect in the interface of another client. If your application does not care about cross-client changes being close to real-time, you can set this very high (minutes or more).

In general, you probably want to set this to the highest value your application can tolerate for cross-client latency. Note that this does not include changes made by other clients connected to the same server process — if you use a single server and do not have external mongo writers you can set `pollingInterval` very high to save load.

If you want near real time behavior, and can not use oplog tailing for whatever reason, you can set `pollingInterval` quite low, around a second, to provide low latency to your users. Be aware this may result in a lot of load on your database. For example, you probably want to make sure all your queries are indexed.

Polling Throttle

The `pollingThrottle` parameter to a `find()` query controls the minimum time Meteor will wait between runs of a the query. The default is 50 milliseconds. To provide near real-time performance when users are connected to the same server process, Meteor will rerun queries when it sees another client do a write that might affect the result of the query. To prevent this from quickly overloading a server with N^2 behavior in many circumstances, each query it throttled and will wait at least `pollingThrottle` milliseconds after a run before rerunning — essentially batching updates.

As with `pollingInterval`, a lower number means more CPU and database load. A higher number leads to increased method call latency. Because Meteor ensures all data update messages have been sent to the client before indicating method completion, increasing this number can increase how long it takes the server to respond to clients with method data completion messages.

In general, you should leave this parameter alone unless you have a particular hot query you want to reduce the frequency of. If you have a query that is either complex and expensive in the database or returns a lot of data, you may want to turn this value up to 100s of milliseconds. If you go over a couple hundred milliseconds, be watchful for parts of the application built without optimistic UI feeling “sluggish.” Method return values may be delayed by the batching raising this parameter causes.

Avoiding Polling

As an optimization, Meteor can avoid rerunning queries when it can conclusively say the contents of a write do not affect the results of a read. If you have all your reads and writes to a collection include an `_id` field, poll-and-diff can be a very efficient and scalable strategy! This is because Meteor has a an optimization specific to the `_id` field, since it can assume the field is immutable. Doing all your reads and writes by `_id` is a big restriction — it gives up many of the advantages of using Mongo — but it can be done in some applications, or at least on some collections within an application.

Including `_id` in queries can also help improve scaling when using oplog tailing. In general, running queries on `_id` will be faster and have more predictable performance. If it is possible to rework your schema so that `_id`s are used more extensively, especially on hot collections, you may able to get orders of magnitude improvement in performance.

Summary

Starting in Meteor 1.3, there are easily configurable options to fine-tune Meteor’s mongo-livedata parameters on a per-query basis. Tuning these parameters for your data access patterns can be an important step in scaling a Meteor app.

The main choice when optimizing is whether or not use oplog tailing. Use the new `disableOplog: true` option `tocollection.find()` to opt-out of oplog tailing on a per query basis.

For queries not using oplog tailing, there are two new time constants that control how often Meteor polls the database for changes — `pollingInterval` and `pollingThrottle`. See the `collection.find` docs for details.

Meteor 1.3 includes a whole range of benefits — these new query tuning options are just one of the many reasons to upgrade your app today. See the Meteor 1.3 Migration Guide for more information about what’s new in 1.3.

I hope this has been helpful for you — post in the comments if you have more scaling tips or experiences to share!

--

--

Meteor is an open-source platform for building top-quality web apps in a fraction of the time, whether you're an expert developer or just getting started.