In a previous post, I described how to run an ad-hoc analysis on the Telemetry data. There is often a need to re-run an analysis on an ongoing basis, whether it be for powering dashboards or just to see how the data changes over time.
We’ve rolled out a new version of the Self-Serve Telemetry Data Analysis Dashboard with an improved interface and some new features.
Now you can schedule analysis jobs to run on a daily, weekly, or monthly basis and publish results to a public-facing Amazon S3 bucket.
Here’s how I expect that the analysis-scheduling capability will normally be used:
- Log in to telemetry-dash.mozilla.org
- Launch an analysis worker in the cloud
- Develop and test your analysis code
- Create a wrapper script to:
- Do any required setup (install python packages, etc)
- Run the analysis
- Put output files in a given directory relative to the script itself
- Download and save your code from the worker instance
- Create a tarball containing all the required code
- Test your wrapper script:
- Launch a fresh analysis worker
- Run your wrapper script
- Verify output looks good
- Head back to the analysis dashboard, and schedule your analysis to run as needed
Dissecting the “Schedule a Job” form
- Job Name is a unique identifier for your job. It should be a short,
descriptive string. Think “what would I call this job in a code repo or
hostname?” For example, the job that runs the data export for the
SlowSQL dashboard is called
slowsql. You can add your username to create a unique job name if you like (ie.
- Code Tarball is the archive containing all the files needed to run your analysis. The machine on which the job runs is a bare-bones Ubuntu machine with a few common dependencies installed (git, xz-utils, etc), and it is up to your code to install anything extra that might be needed.
- Execution Commandline is what will be invoked on the launched server. It is the entry point to your job. You can specify an arbitrary Bash command.
- Output Directory is the location of results to be published to S3. Again, these results will be publicly visible.
- Schedule Frequency is how often the job will run.
- Day of Week for jobs running with
- Day of Month for jobs running with
- Time of Day (UTC) is when the job will run. Due to the distributed nature of the Telemetry data processing pipeline, there may be some delay before the previous day’s data is fully available. So if your job is processing data from “yesterday”, I recommend using the default vaulue of Noon UTC.
- Job Timeout is the failsafe for jobs that don’t terminate on their own. If the job does not complete within the specified number of minutes, it will be forcibly terminated. This avoids having stalled jobs run forever (racking up our AWS bill the whole time).
A concrete example of an analysis job that runs using the same framework is the
SlowSQL export. The
package.sh script creates the code tarball for this
job, and the
run.sh script actually runs the analysis on a daily basis.
In order to schedule the SlowSQL job using the above form, first I would run
package.sh to create the code tarball, then I would fill the form as follows:
- Job Name:
- Code Tarball: select
- Execution Commandline:
- Output Directory:
output– this directory is created in
run.shand data is moved here after the job finishes
- Schedule Frequency:
- Day of Week: leave as default
- Day of Month: leave as default
- Time of Day (UTC): leave as default
- Job Timeout:
175– typical runs during development took around 2 hours, so we wait just under 3 hours
The daily data files are then published in S3 and can be used from the Telemetry SlowSQL Dashboard.
Beyond Typical Scheduling
The job runner doesn’t care if your code uses the python MapReduce framework or your own hand-tuned assembly code. It is just a generalized way to launch a machine to do some processing on a scheduled basis.
So you’re free to implement your analysis using whatever tools best suit the job at hand.
The sky is the limit, as long as the code can be packaged up as a tarball and executed on a Ubuntu box.
The pseudo-code for the general job logic is
1 2 3 4 5
One final reminder: Keep in mind that the results of the analysis will be made publicly available on Amazon S3, so be absolutely sure that the output from your job does not include any sensitive data.
Aggregates of the raw data are fine, but it is very important not to output the raw data.