We’ve rolled out a new version of the Self-Serve Telemetry Data Analysis Dashboard with an improved interface and some new features.
Now you can schedule analysis jobs to run on a daily, weekly, or monthly basis and publish results to a public-facing Amazon S3 bucket.
Here’s how I expect that the analysis-scheduling capability will normally be used:
slowsql
. You can add your username to
create a unique job name if you like (ie. mreid-slowsql
).daily
frequency.monthly
frequency.A concrete example of an analysis job that runs using the same framework is the
SlowSQL export. The package.sh
script creates the code tarball for this
job, and the run.sh
script actually runs the analysis on a daily basis.
In order to schedule the SlowSQL job using the above form, first I would run
package.sh
to create the code tarball, then I would fill the form as follows:
slowsql
slowsql-0.3.tar.gz
./run.sh
output
– this directory is created in run.sh
and
data is moved here after the job finishesDaily
175
– typical runs during development took around 2 hours,
so we wait just under 3 hoursThe daily data files are then published in S3 and can be used from the Telemetry SlowSQL Dashboard.
The job runner doesn’t care if your code uses the python MapReduce framework or your own hand-tuned assembly code. It is just a generalized way to launch a machine to do some processing on a scheduled basis.
So you’re free to implement your analysis using whatever tools best suit the job at hand.
The sky is the limit, as long as the code can be packaged up as a tarball and executed on a Ubuntu box.
The pseudo-code for the general job logic is
1 2 3 4 5 |
|
One final reminder: Keep in mind that the results of the analysis will be made publicly available on Amazon S3, so be absolutely sure that the output from your job does not include any sensitive data.
Aggregates of the raw data are fine, but it is very important not to output the raw data.
]]>Here’s a graph of the submission rate from the nightly
channel (where most of the action has taken place) over the past 60 days:
The x-axis is time, and the y-axis is the number of submissions per hour.
Points A (December 17) and B (January 8) are false alarms, and were just cases where the stats logging itself got interrupted, and so don’t represent data outages.
Point C (January 16) is where it starts to get interesting. In that case, Firefox nightly builds stopped submitting Telemetry data due to a change in event handling when the Telemetry client-side code was moved from a .js file to a .jsm module. The resolution is described in Bug 962153. This outage resulted in missing data for nightly builds from January 16th through to January 22nd.
As shown on the above graph, the submission rate dropped noticeably, but not anywhere close to zero. This is because not everyone on the nightly channel updates to the latest nightly as soon as it’s available, so an interesting side-effect of this bug is that we can see a glimpse of the rate at which nightly users update to new builds. In short, it looks like a large portion of users update right away, with a long tail of users waiting several days to update. The effect is apparent again as the submission rate recovers starting on January 22nd.
The second problem with submissions came at point D (February 1) as a result of changing the client Telemetry code to use OS.File for saving submissions to disk on shutdown. This resulted in a more gradual decrease in submissions, since the “saved-session” submissions were missing, but “idle-daily” submissions were still being sent. This outage resulted in partial data loss for nightly builds from February 1st through to February 7th.
Both of these problems have been limited to the nightly
channel, so the actual volume of submissions that were lost is relatively low. In fact, if you compare the above graph to the graph for all channels:
The anomalies on January 16th and February 1st are not even noticeable within the usual weekly pattern of overall Telemetry submissions (we normally observe a sort of “double peak” each day, with weekends showing about 15-20% fewer submissions). This makes sense given that the release
channel comprises the vast majority of submissions.
The above graphs are all screenshots of our Heka aggregator instance. You can look at a live view of the submission stats and explore for yourself. Follow the links at the top of the iframes to dig further into what’s available.
There was a third data outage recently, but it cropped up in the server’s data validation / conversion processing so it doesn’t appear as an anomaly on the submission graphs. On February 4th, the revision
field for many submissions started being rejected as invalid. The server code was expecting a revision
URL of the form http://hg.mozilla.org/...
, but was getting URLs starting with https://...
and thus rejecting them. Since revision
is required in order to determine the correct version of Histograms.json
to use for validating the rest of the payload, these submissions were simply discarded. The change from http
to https
came from a mostly-unrelated change to the Firefox build infrastructure. This outage affected both nightly and aurora, and caused the loss of submissions from builds between February 4th and when the server-side fix landed on Februrary 12th.
So with all these outages happening so close together, what are we doing to fix it?
Going forward, we would like to be able to completely avoid problems like the overly-eager rejection of payloads during validation, and in cases where we can’t avoid problems, we want to detect and fix them as early as possible.
In the specific case of the revision
field, we are adding a test that will fail if the URL prefix changes.
In the case where the submission rate drops, we are adding automatic email notifications which will allow us to act quickly to fix the problem. The basic functionality is already in place thanks to Mike Trinkala, though the anomaly detection algorithm needs some tweaking to trigger on things like a gradual decline in submissions over time.
Similarly, if the “invalid submission” rate goes up in the server’s validation process, we want to add automatic alerts there as well.
With these improvements in place, we should see far fewer data outages in Q2 and beyond.
While poking around at various graphs to document this post, I noticed that Bug 967203 was still affecting aurora
, so the fix has been uplifted.
The Telemetry Server has been deployed on AWS for just over a month now, so it’s time for an update.
The server code repository has been moved into the Mozilla github group,
and the mreid-moz
repo forwards there, so the change should be seamless.
The Telemetry dashboards have also moved! They are now located at telemetry.mozilla.org, nice and easy to remember.
Moving on to more interesting news, anyone with an @mozilla.com
email can now
run their own Telemetry analysis jobs in the cloud. The procedure is still very
much in alpha/beta state, but if you’ve got a question that can be answered
using Telemetry data, you’re in luck.
Jonas has built a mechanism for provisioning a ubuntu server as an Amazon EC2
instance. These machines (c3.4xlarge
in EC2 terms) have read-only
permission and a fast connection to Telemetry data stored in S3. Each machine
will be available for 24 hours, and will cost about $40USD to run. If you don’t
need it for the full day, you can kill it early by following the instructions in
the webapp below.
Here’s how it works:
@mozilla.com
email address as mentioned
above).Server Name
field should be a short descriptive
name, something like ‘mreid chromehangs analysis’ is good. Upload your
SSH public key (this allows you to log in to the server once it’s started up)Submit
.Ok, that’s all well and good, but what is an analysis job?
The easiest (and probably most familiar to anyone who has worked with Telemetry data in the past) is to run a MapReduce job.
This requires a bit of setup on the machine you provisioned above:
1
|
|
mapreduce/examples/*.json
. Here is a reasonably selective one.1 2 3 4 5 6 7 8 9 |
|
A few notes for successful jobs.
nightly
data.<work-dir>/cache
and add the “—local-only” parameter to skip downloading
files from S3 every timeOne final note – the Telemetry MapReduce framework is a simple way to download the set of records you are interested in, and do something for each record in that data set.
If you don’t want to do your analysis with this framework, you can just use it to download the data (or even skip it altogether and download data using the AWS command-line tools directly). Once the data has been downloaded to the machine, you’re free to analyze it using whatever other language / tools you’re comfortable with.
Happy Data Crunching!
]]>Since my last update, there have been a few last-minute code changes to get things in ship-shape for deployment. The bulk of those changes were to the scripts used to provision machines on Amazon’s EC2 infrastructure, but there was one more structural change of note.
The logic for processing incoming submissions (that’s the “validation, conversion, and compression” part) was previously controlled by a master process which would launch a worker node to do the actual processing. Without an easy way for masters to co-ordinate, it was difficult to launch extra workers in cases where the rate of processing was not keeping up, since each master expected its worker to process all available data.
The solution was to switch to using a queue to keep track of data available to be processed, and having the worker nodes claim data from the queue. This results in a nicely decoupled architecture, where starting up more workers (or killing off idle ones) is clean and easy.
Anyways, getting back to the main point… The cutover is complete, and the Telemetry submissions are now going to “The Cloud!”
It turns out that the node.js version of the web server is efficient enough to allow us to handle the entire volume of traffic using only a pair of “t1.micro” nodes in EC2 (behind a load balancer). Pretty slick.
So far, running on AWS has been pretty nice. The Elastic Load Balancers make it nearly-trivial to add or remove nodes from the pool, and include useful (if limited) monitoring. On the HTTP-serving nodes themselves, we have some more detailed and application-specific monitoring using Heka. The boto library makes it very easy to provision new nodes using python.
Now that the Telemetry Server is out in the wild, the next step is to get the new Dashboard playing nicely with the new data source. Jonas Finnemann Jensen is working on that.
There’s still more work to do once the dashboard integration is complete, including finishing up the C++ port of the “process incoming” code (which will hopefully provide a large speedup compared to the current python implementation), migrating the provisioning over to Amazon Cloud Formation, creating a frontend for managing/running Telemetry MapReduce jobs, and exporting the historic data from the previous Hadoop backend into the new S3 storage.
]]>There’s now a tracking bug that should list everything that needs to be done before things go live: Bug 911300.
The code has changed a fair bit since my last update, the most noticeable modification being that the HTTP server is now using node.js instead of a python+flask server.
Here is the basic architecture diagram for the system:
Data flows into an incoming
bucket, which can then be processed separately by
one or more processing nodes, each of which publishes finalized data to a
published
bucket.
The MapReduce code then reads from the published
location for data analysis.
Other changes include updating the “process incoming data” code to take advantage of multiple processors, though it appears that python performance is still a major bottleneck. Fortunately Mike Trinkala is working on porting the conversion code to C++.
Keep an eye on the bug above for up-to-the-minute progress information, or as
always, feel free to drop by #telemetry
on irc.mozilla.org
.
I started on the Metrics team on August 8th, 2011, and more recently moved over to the Performance team. Lots of fun stuff to learn and do, and I’m happier than ever that I joined Mozilla.
Two orders of business in this post. First, another status update on the Telemetry Reboot project.
Things are basically feature-complete on the server now! Very exciting. The high-altitude view looks like this:
Data comes in as HTTPS submissions, and is saved to disk in its raw form. These files are converted to the new Histogram storage format, compressed, then sent to Amazon S3 for permanent storage.
You can run MapReduce jobs on the S3 data (though at the moment there’s a lot more time spent on data transfer than I would like). If you run a MapReduce job on the server node, you can also include up-to-the-minute data that has not yet been exported to S3.
Now that the prototype (which reminds me a bit of this description of the Hideous Creature Warning: strong language) is working, the next step is to benchmark things to make sure we can handle release-level volumes.
The second order of business is an interesting thing I learned about Python File I/O that I thought I’d share.
Normal file writes are not atomic, so if you have a situation where multiple concurrent processes are appending lines to the same file, it’s possible (and in fact very likely) to end up with garbled lines. The Telemetry storage format requires that each record be on its own line, so this caused real problems during testing.
The solution: use io.open(...)
.
For example, doing this will produce non-atomic writes:
1 2 3 |
|
So some_content
could actually be written out in multiple disk operations,
with other writes by other processes in between.
Instead, to achieve atomic writes, you would use something like this:
1 2 |
|
Maybe this is common knowledge for Python folks, but I hope it helps.
]]>First and foremost, the telemetry-server code is now on github.
As of Friday, there is a basic prototype server up and running on an Amazon EC2 instance.
The prototype is able to accept submissions via HTTP, using URLs without or with dimension components. Submissions are then converted to the new storage format and saved to disk in the new storage layout.
As part of the conversion process, the histograms in the payload are validated against the correct revision of Histograms.json
. This file is automatically fetched from the Mozilla mercurial server, then cached locally. The RevisionCache
class encapsulates this logic.
The prototype server in its current form is too monolithic, and needs to be split up such that receiving submissions via HTTP is independent from the remainder of the data pipeline.
The first part, recieving and logging HTTP submissions, is a well-understood problem and there are many existing ways so solve it. Some options are:
The second part is already in place and working, with code that should be quite easy to pull out and run separately. One additional piece of functionality that would be nice is to calculate realtime stats before/while data is being persisted.
Other than separating receiving and processing functionality, the next major task will be building a MapReduce framework to run on the Telemetry data. Initially, it will not be a full-blown cluster implementation, since reinventing that particular wheel is a huge task, but rather it will run on a single machine using multiple processes to parallelize map and reduce functionality.
One major advantage of the new storage layout is that MapReduce jobs will be able to filter the desired input data on a number of dimensions basically for free. Jobs that only need to look at a small subset of the data should be very efficient.
The first use case of the MapReduce framework is to produce the static data files for the new telemetry frontend.
]]>The general pipeline for Telemetry data looks like this:
My initial focus will be on steps 5 and 6, specifically on improving the persistent storage format to be more space-efficient and to eliminate the need for re-processing to slice the data (by factors like Channel, Version, Day, etc).
I plan to post a link to the code repository here (as soon as there is something useful to share).
]]>