The paper describes Google's tools for handling the challenging task of running many experiments simultaneously and includes tidbits on how they launch new features. Some excerpts:
We want to be able to experiment with as many ideas as possible .... It should be easy and quick to set up an experiment ... Metrics should be available quickly so that experiments can be evaluated quickly. Simple iterations should be quick ... The system should ... support ... gradually ramping up a change to all traffic in a controlled way.One thing I like about the system they describe is that the process of launching is the same as the process for experimentation. That's a great way to set things up, treating everything to be launched as an experiment. It creates a culture where every change to be launched needs to be tested online and experiments are not treated so much as tests to be taken down when done as candidates to be sent out live as soon as they prove themselves.
[Our] solution is a multi-factorial system where ... a request would be in N simultaneous experiments ... [and] each experiment would modify a different parameter. Our main idea is to partition parameters into N subsets. Each subset is associated with a layer of experiments. Each request would be in at most N experiments simultaneously (one experiment per layer). Each experiment can only modify parameters associated with its layer (i.e., in that subset), and the same parameter cannot be associated with multiple layers ... [We] partition the parameters ... [by] different binaries ... [and] within a binary either by examination (i.e., understanding which parameters cannot be varied independently of one another) or by examining past experiments (i.e., empirically seeing which parameters were modiﬁed together in previous experiments).
Given this infrastructure, the process of evaluating and launching a typical feature might be something like: Implement the new feature in the appropriate binary (including code review, binary push, setting the default values, etc) ... Create a canary experiment (pushed via a data push) to ensure that the feature is working properly ... Create an experiment or set of experiments (pushed via a data push) to evaluate the feature ... Evaluate the metrics from the experiment. Depending on the results, additional iteration may be required, either by modifying or creating new experiments, or even potentially by adding new code to change the feature more fundamentally ... If the feature is deemed launchable, go through the launch process: create a new launch layer and launch layer experiment, gradually ramp up the launch layer experiment, and then ﬁnally delete the launch layer and change the default values of the relevant parameters to the values set in the launch layer experiment.
We use real-time monitoring to capture basic metrics (e.g., CTR) as quickly as possible in order to determine if there is something unexpected happening. Experimenters can set the expected range of values for the monitored metrics (there are default ranges as well), and if the metrics are outside the expected range, then an automated alert is ﬁred. Experimenters can then adjust the expected ranges, turn off their experiment, or adjust the parameter values for their experiment. While real-time monitoring does not replace careful testing and reviewing, it does allow experimenters to be aggressive about testing potential changes, since mistakes and unexpected impacts are caught quickly.
Another thing I like is the real-time availability of metrics and ability to very quickly change experiment configurations. Not only does that allow experiments to be shut down quickly if they are having a surprisingly bad impact which lowers the cost of errors, but also it speeds the ability to learn from the data and iterate on the experiment.
Finally, the use of standardized metrics across experiments and an "experiment council" of experts who can be consulted to help interpret the experimental results is insightful. Often, results of experiments are subject to some interpretation, unfortunately enough interpretation that overly eager folks at a company can attempt to torture the data until it says what they want (even when they are trying to be honest), so an effort to help people keep decisions objective is a good idea.
One minor but surprising tidbit in the paper is that binary launches are infrequent ("weekly"); only configuration files can be pushed frequently. I would have thought they would push binaries daily. In fact, reading between the lines a bit, it sounds like developers might have to do a bit of extra work to deal with infrequent binary pushes, trying to anticipate what they will want during the experiments and writing extra code that can be enabled or disabled later by configuration file, which might interfere with their ability to rapidly learn and iterate based on the experimental results. It also may cause the configuration files to become very complex and bug-prone, which is alluded to in the section of the paper talking about the need for data file checks. In general, very frequent pushes are desirable, even for binaries.