Skip to main content

Incident Report 02-03-2022

Summary#

At 14:20 GMT a deployment to production was completed and users reported they were getting error messages. The root cause of the issue was that a new banner that was added to the page did not consistently contain a URL that was required for it to work. This meant that some users were unable to access residents' pages. The issue was fully resolved at 15:40 GMT and the application was fully functional again.

Timeline#

14:20 GMT - A deployment to production was completed

14:21 GMT - First report of the incident, reports of

{"error":"Error logging in"}

14:26 GMT - Reports of a different error message

“Error message: Cannot read properties of undefined (reading 'auth')”

14:30 GMT - Developers start mobbing to identify and resolve the issue

15:18 GMT - Communication was sent to the Children and Family Service (CFS) to explain an issue has occurred and a fix is being worked on

15:18 GMT - Communication was sent to the Adults Social Care (ASC) service to explain an issue has occurred and a fix is being worked on. (Alex noted that the initial comms bounced back so he had to contact them by an alternative method)

15:40 GMT - A fix was released to production that resolves the issue

15:44 GMT - Communication was sent to CFS to confirm the issue has been resolved

15:44 GMT - Communication was sent to ASC to confirm the issue has been resolved

Root Cause#

During a planned deployment to production, a preview banner was added to the resident view that gave the option to use a new resident view and provide feedback. The link to provide feedback was not consistently populating. Due to this, the resident view was not consistently rendered and some users were experiencing an error stating

“Error message: Cannot read properties of undefined (reading 'auth')”.

Technical Detail#

  • The preview banner that was added used a Link field along with an href attribute.

  • Link fields will attempt to prefetch the href that has been passed. Link fields use a formatUrl function to properly format the URL.

  • The first step of the formatUrl function is to destructure the passed in URL. If no URL is passed to the formatUrl function then the function is unable to destructure this and an error is thrown. This was the error that was seen by users during this incident.

Resolution and Recovery#

  • As an immediate resolution, the new preview banner was removed from the resident view. This resolved the issue for users.
  • This issue with the preview banner meant that certain pages were not loading however no data was lost due to this issue and recovery was not required.

Corrective and preventive measures#

  • The preview banner was reviewed further following the resolution of this incident.
  • The code was changed to use a different field to display the link to the feedback form rather than a Link, this provides a more robust way of showing a link.
  • To prevent a scenario where the feedback form URL was unavailable, the URL was hardcoded into the new field.

This was tested and deployed to production on 03/03/2022.

—---------------------------------------------------------------------------

Incident Commentary#

Setting: remote group mobbing

Screen presenter: Rachel Hutchinson - Made Tech

Scribe: Tony Griffin - Made Tech

—---------------------------------------------------------------------------

2:30 pm onwards

  • Dev team made aware of incident

  • Mobbed from this time onwards

  • Cloudwatch production account was checked for error warnings

  • Tuomo confirmed there were normal amount of errors appearing in Cloudwatch

  • Lawrence asks if we’ve tested the URL that we were sent, ie the “/people/[id]”

Error in console: “Cannot read properties of undefined ( reading ‘auth’ )At Object.t.formatURL ( )"

-- Question: What’s changed recently?

  • Identified newly deployed code has rolled to prod and coincides with the start of the incident (roughly 7 mins before devs were notified of the incident)

-- Question: Is it tested?

  • Lawrence asks if it was tested/has tests written?

Sentry: Check the Main Application

  • The difficulty here is to spot errors without a good source map being available in Sentry

AWS: Errors in lambda logs

Tuomo: Says he’s seeing 403 errors in the front end Lambda logs that he hasn’t seen before

“Lambda runtime failed to post handler success response 403”

-- Question: Has the API key been changed?

  • Lawrence asks if the API key has changed for any reason? - Apparently not

-- Question: Quickest fix a redeployment?

  • Rachel asked if the quickest fix is to redeploy the last working deployment? - Possibly yes

  • Lawrence confirms the error is definitely in the FE as there is no accompanying HTTP error.

  • The error appears due to some kind of object destructuring as per the console error.

  • Due to the “System Error” banner appearing on pages, the error never hits Sentry properly so we don’t get a stack trace of the error.

  • Jack notes that if you change “/people/[id]” to the newer URL of “/residents/[id]” then that works correctly.

  • Jaye says people and residents use the same API endpoint, so have the same data source.

  • He lets us know that 404 errors can be thrown by the app trying to extract worker’s names from their email (So they are expected errors in some cases)

Remove Banner component code

  • (Rachel removed the Banner code changes for the repo at this stage)

  • This is this code that we believe is generating the error

https://github.com/LBHackney-IT/lbh-social-care-frontend/pull/870

3:08 pm onwards

Same error reported previously

  • Lawrence notes that the same error has been noted before - the error is definitely due to the <Link> tag & how it does restructuring on the URL object.

(Happening in the pre-build phase when deploying to Vercel’s set up) in the static content build possibly?

https://github.com/vercel/next.js/issues/30028

-- Question: Need for FEEDBACK_URL?

  • Rachel asks if we need the NEXT_PUBLIC_FEEDBACK_URL? (To be decided)

  • At this point we know it's definitely one of the links on the pages

-- Question: Why does it work on Staging but not Production?

  • Jaye notes that this works on staging but not on prod.

  • Rachel notes that we don’t see this preview banner for an Adult on staging.

  • This leads to investigating the feature flag set up later

Error thrown from inside node modules

  • Lawrence identifies the line throwing an error
node_modules/next/dist/shared/lib/router/utils/format-url.js ( line:32 )
  • “urlObj" is undefined
  • NextJS’s router uses this in it’s LINK component on its “href” attribute

PreviewBanner component

  • PhaseBanner component uses a straight <a> tag

  • PreviewBanner component uses a <Link> tag and it doesn’t have a “href” set to true, so it tries to do a pre-fetch on the URL & if the URL is undefined it will fail and throw an error.

  • <Link> tags are meant to be used to link to resources internal to the application only, not external links.

This component was removed from the application as we believed it was the source of the issue.

Fix: Deployment to Staging

  • Rachel lined up some people on Staging to see if the deployment had fixed the problem

  • The following is an adults record:

Rachel checked:

“/residents/7” “/residents/7/details”“/residents/7/workflows”
  • Must wait till the deployment changes hit production to confirm the fix

Fix: Deployment to Production

3:35 pm

  • Person we checked on prod: “/residents/14”

  • Banner component was totally removed

  • Contacted Alex at this time to let him know.

Notes on types of people records checked

Adult: “/people/34”  Foo BarChild: “/people/19”   Cristobal Cawdell (I believe to be a Child  record, redirects to “/residents/19”) (Not sure on this?)

Location of error in the PreviewBanner component

/components/NewPersonView/PreviewBanner.tsx     (Line: 32)

Feature flag setup

  • We noted that we don’t see this preview banner for an Adult on staging.

  • This picks up on Rachel’s earlier comment and leads to investigating the feature flag & seeing that we can’t see the Banner component in Staging as the feature flag was set up to only be active on production.

Issues Identified after the Deployment

  • PreviewBanner component needs to be rewritten so it uses a straight a tag and removes the tag.

  • Work out why the env var NEXT_PUBLIC_FEEDBACK_URL isn’t set when the app is built?. Lawrence suggests making it a non NEXT_PUBLIC env var in CircleCi

  • Sort error boundaries within the Next’s app - to allow Sentry to retrieve errors

  • Tuomo suggests hardcoding the FEEDBACK_URL as the form is restricted to Hackney staff.

  • We confirmed that this was the case by having Jack try to access the FEEDBACK_URL form using his Made Tech account and was unable to.

  • We hardcoded the FEEDBACK_URL in the PhaseBanner and PreviewBanner components & remove reference to it in the serverless.yml file in the application.

  • Deleted NEXT_PUBLIC_FEEDBACK_URL from CircleCI & the AWS Staging and Prod environments

Follow up Commits

  • Re-added: Preview Banner - to “pages/people/[id]” & rewrote component to use an <a> tag.

https://github.com/LBHackney-IT/lbh-social-care-frontend/pull/871/files

  • Update feature flag - to show everywhere except for on production environment

Action points#

  • Ensure all code is supported by having tests written for it.
  • Breaking changes should show up when a piece of code has tests written around it.
  • Create a pull request and have another engineer look over the proposed changes
  • Before deploying - Always send a message in the relevant channels that you would like to do a deployment and check how this may affect the work of others.
  • The current deployment process will push all code that is ready to be deployed, not just yours.
  • Clear channels for communications to both the CSF and ASC services should be set up and verified to enable quick correspondence between us and the services should there be need to contact them quickly in the future.