Filter by topic and date
IETF 109 Technical Retrospective
- Jay DaleyIETF Executive Director
17 Dec 2020
This technical retrospective examines the technical services provided during IETF 109 to help understand the impact of service improvements made between IETF 108 and IETF 109 and what went well and what did not go so well during IETF 109. It concludes with the current plan for improvement to be implemented prior to IETF 110.
For IETF meetings we provide a set of services, each using different technologies and with a different team responsible for service performance:
|Network (Hackathon only)||Own kit||NOC|
|Server platform||Multiple platforms||NOC|
|Remote participation service||Meetecho||NOC (Meetecho)|
|Agenda and materials||Datatracker||Tools Team|
|Authentication service||Datatracker + OIDC||Tools Team|
|Group chat||XMPP + trials||Secretariat|
|Social interaction service||Gather||Secretariat|
|Support||RT / Trac||All|
Service improvements made prior to IETF 109
During each IETF meeting we collate feedback from participants and after each meeting we carry out a survey, the results of which are used to pare down the feedback into the priorities that need to be addressed before the next meeting. After IETF 108 and prior to IETF 109, the following changes were made:
1. Redesigned Meetecho user interaction model
In response to detailed feedback, Meetecho redesigned the user interaction model to switch from a speaking queue with chairs granting permissions, to a more fluid model where chairs controlled the meeting orally while allowing individuals to interrupt as they saw fit.
RESULT: The feedback received during IETF 109 is that these changes addressed all of the major concerns with only minor irritations left.
2. Datatracker improvements to address login issues
During IETF 108 many people had an issue with their initial session authentication, with the reason in most cases related to the participant having multiple Datatracker accounts with one used to register for the meeting and another used for logging in and therefore failing. The Tools Team added code to Datatracker to catch additional cases where a match-up between registration and person records wasn’t happening automatically.
RESULT: The number of issues with initial session authentication in IETF 109 was down 80%+ on IETF 108.
3. Redesigned Gather map and redirection from Meetecho
In an effort to improve take up of Gather and the value that participants get from it, the Gather map was redesigned to provide specific areas for post-session discussions and hackathon support. In addition, Meetecho was changed so that when a session ends the participant is presented with a choice on whether to return to the agenda or the Gather session, dropping directly into the correct post-session area.
RESULT: Feedback during the meeting indicated that the new map was well received but further improvements are still possible. The redirection from Meetecho was not widely used as most people close the browser tab or the exit button before the session officially ends.
4. Meetecho client issue detection and recovery
There was a recommendation from IETF 108 participants that Meetecho could do more to help participants detect and recover from client failures. To address this Meetecho developed a new client monitoring service.
RESULT: The new monitoring service proved to be problematic and was deactivated early in IETF 109 (see Incident #1 below for more details).
Service performance during IETF 109
Based on the same combination of feedback during the meeting, data gathered during the meeting and responses to the post-meeting survey, we have been able to identify what went well and what needs improvement.
What went well
- Redesigned Meetecho user interaction model, as noted above.
- Datatracker improvements to address login issues, as noted above.
- The major service interruptions that were under our direct control were responded and addressed quickly.
- Information sharing between the NOC (including Meetecho), Tools Team, Secretariat and LLC was excellent.
What did not go so well
- Service reliability was well below expectations, which in the end comes down to a need to significantly improve testing of production services.
- Responsibility for investigating and resolving certain issues was unclear and meant that some issues were slow to be properly investigated and communication with participants was patchy.
- Automated monitoring services/alarms were insufficient meaning that technical teams had to rely on manual tests and failure reports.
- The post-meeting survey indicated that satisfaction with our support services was below expectations.
Improvements planned prior to IETF 110
1. Appointment of an integration manager
The NOC Lead will be appointed as the integration manager responsible for ensuring that, prior to the meeting:
- The services are tested as an integrated suite (details below)
- Everyone involved in the delivery of services is kept in the loop at all stages
2. Production service testing
The integration manager will be responsible for ensuring that the integrated suite of services will be tested to meet the following requirements prior to IETF 110:
- Full production stack and full feature set tested
- Tested to the scale/volume/duration expected at IETF 110
- Tested for the range of clients expected at IETF 110
- Tested for failover in previous and possible failure scenarios
3. Appointment of an incident controller
The NOC Lead will be appointed as the incident controller during the IETF 110 meeting, responsible for a formal management process for all reported issues. This process includes ensuring the following happens:
- All reported issues are quickly and appropriately investigated.
- Participants, both those directly affected and as a whole, are communicated with quickly and usefully.
- Services are restored as soon as possible and as correctly as possible.
- A detailed log is kept.
4. Additional external monitoring from NOC
The NOC will implement additional external monitoring (i.e. on the separate NOC monitoring system not the production services themselves) of at least the following:
- jabber service
- OIDC (authentication) service
5. New support system and process
Currently we have two support systems and multiple entry points into those, with several different email addresses in use. This will be drastically simplified to one system with one address. In addition a new support system will be implemented that provides a modern feature set.
This improvement was planned before IETF 109 but could not be implemented in time for that and will now be implemented for IETF 110.