Over the coming weeks I’ll be embarking on a new project to revamp the acquisition, storage, and processing of Mozilla Telemetry data.
The general pipeline for Telemetry data looks like this:
- A Firefox user enables Telemetry in their browser
- The browser generates performance data as it is used
- Once a day, the browser submits the performance data to Mozilla via HTTPS
- Data is accepted by the server and saved to a queue
- The queue is polled for changes and payloads are saved to persistent storage
- Analysis jobs are run against the persisted data, including daily aggregations
- Results of analysis are visualized in Telemetry Dashboards
My initial focus will be on steps 5 and 6, specifically on improving the persistent storage format to be more space-efficient and to eliminate the need for re-processing to slice the data (by factors like Channel, Version, Day, etc).
I plan to post a link to the code repository here (as soon as there is something useful to share).