The
longloadapplication in thereliability-reviewproject fails to start. Diagnose and then fix the issue. The application needs 512 MiB of memory to work.After you fix the issue, you can confirm that the application works by running the
~/DO180/labs/reliability-review/curl_loop.shscript that thelabcommand prepared. The script sends requests to the application in a loop. For each request, the script displays the pod name and the application status. Press Ctrl+C to quit the script.Log in to the OpenShift cluster.
[student@workstation ~]$
oc login -u developer -p developer \https://api.ocp4.example.com:6443Login successful. ...output omitted...Set the
reliability-reviewproject as the active project.[student@workstation ~]$
oc project reliability-review...output omitted...List the pods in the project. The pod is in the
Pendingstatus. The name of the pod on your system probably differs.[student@workstation ~]$
oc get podsNAME READY STATUS RESTARTS AGE longload-64bf8dd776-b6rkz 0/1Pending0 8m1sRetrieve the events for the pod. No compute node has enough memory to accommodate the pod.
[student@workstation ~]$
oc describe pod longload-Name: longload-64bf8dd776-b6rkz Namespace: reliability-review ...output omitted... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 8m default-scheduler 0/1 nodes are available: 164bf8dd776-b6rkzInsufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.Review the resource requests for memory. The
longloaddeployment requests 8 GiB of memory.[student@workstation ~]$
oc get deployment longload -o \ jsonpath='{.spec.template.spec.containers[0].resources.requests.memory}{"\n"}'8GiSet the memory requests to 512 MiB. Ignore the warning message.
[student@workstation ~]$
oc set resources deployment/longload \ --requests memory=512Mideployment.apps/longload resource requirements updatedWait for the pod to start. You might have to rerun the command several times for the pod to report a
Runningstatus. The name of the pod on your system probably differs.[student@workstation ~]$
oc get podsNAME READY STATUS RESTARTS AGE longload-5897c9558f-cx4gt 1/1Running0 86sRun the
~/DO180/labs/reliability-review/curl_loop.shscript to confirm that the application works.[student@workstation ~]$
~/DO180/labs/reliability-review/curl_loop.sh1 curl: (7) Failed to connect to master01.ocp4.example.com port 30372: Connection refused 2 longload-5897c9558f-cx4gt: app is still starting 3 longload-5897c9558f-cx4gt: app is still starting 4 longload-5897c9558f-cx4gt: app is still starting 5 longload-5897c9558f-cx4gt:Ok6 longload-5897c9558f-cx4gt:Ok7 longload-5897c9558f-cx4gt:Ok8 longload-5897c9558f-cx4gt:Ok...output omitted...Press Ctrl+C to quit the script.
When the application scales up, your customers complain that some requests fail. To replicate the issue, manually scale up the
longloadapplication to three replicas, and run the~/DO180/labs/reliability-review/curl_loop.shscript at the same time.The application takes seven seconds to initialize. The application exposes the
/healthAPI endpoint on HTTP port 3000. Configure thelongloaddeployment to use this endpoint, to ensure that the application is ready before serving client requests.Open a new terminal window and run the
~/DO180/labs/reliability-review/curl_loop.shscript.[student@workstation ~]$
~/DO180/labs/reliability-review/curl_loop.sh1 longload-5897c9558f-cx4gt: Ok 2 longload-5897c9558f-cx4gt: Ok 3 longload-5897c9558f-cx4gt: Ok 4 longload-5897c9558f-cx4gt: Ok ...output omitted...Leave the script running and do not interrupt it.
Scale up the application to three replicas.
[student@workstation ~]$
oc scale deployment/longload --replicas 3deployment.apps/longload scaledWatch the output of the
curl_loop.shscript in the second terminal. Some requests fail because OpenShift sends requests to the new pods before the application is ready....output omitted... 22 longload-5897c9558f-cx4gt: Ok 23 longload-5897c9558f-cx4gt: Ok 24 longload-5897c9558f-cx4gt: Ok 25 curl: (7) Failed to connect to master01.ocp4.example.com port 30372: Connection refused 26 curl: (7) Failed to connect to master01.ocp4.example.com port 30372: Connection refused 27 longload-5897c9558f-cx4gt: Ok 28 curl: (7) Failed to connect to master01.ocp4.example.com port 30372: Connection refused 29 longload-5897c9558f-cx4gt: Ok 30 curl: (7) Failed to connect to master01.ocp4.example.com port 30372: Connection refused 31 longload-5897c9558f-tpssf: app is still starting 32 longload-5897c9558f-kkvm5: app is still starting 33 longload-5897c9558f-cx4gt: Ok 34 longload-5897c9558f-tpssf: app is still starting 35 longload-5897c9558f-tpssf: app is still starting 36 longload-5897c9558f-tpssf: app is still starting 37 longload-5897c9558f-cx4gt: Ok 38 longload-5897c9558f-tpssf: app is still starting 39 longload-5897c9558f-cx4gt: Ok 40 longload-5897c9558f-cx4gt: Ok ...output omitted...
Leave the script running and do not interrupt it.
Add a readiness probe to the
longloaddeployment. Ignore the warning message.[student@workstation ~]$
oc set probe deployment/longload --readiness \ --initial-delay-seconds 7 \ --get-url http://:3000/healthdeployment.apps/longload probes updatedScale down the application back to one pod.
[student@workstation ~]$
oc scale deployment/longload --replicas 1deployment.apps/longload scaledIf scaling down breaks the
curl_loop.shscript, then press Ctrl+c to stop the script in the second terminal. Then, restart the script.To test your work, scale up the application to three replicas again.
[student@workstation ~]$
oc scale deployment/longload --replicas 3deployment.apps/longload scaledWatch the output of the
curl_loop.shscript in the second terminal. No request fails....output omitted... 92 longload-7ddcc9b7fd-72dtm: Ok 93 longload-7ddcc9b7fd-72dtm: Ok 94 longload-7ddcc9b7fd-72dtm: Ok 95 longload-7ddcc9b7fd-qln95: Ok 96 longload-7ddcc9b7fd-wrxrb: Ok 97 longload-7ddcc9b7fd-qln95: Ok 98 longload-7ddcc9b7fd-wrxrb: Ok 99 longload-7ddcc9b7fd-72dtm: Ok ...output omitted...
Press Ctrl+C to quit the script.
Configure the application so that it automatically scales up when the average memory usage is above 60% of the memory requests value, and scales down when the usage is below this percentage. The minimum number of replicas must be one, and the maximum must be three. The resource that you create for scaling the application must be named
longload.The
labcommand provides the~/DO180/labs/reliability-review/hpa.ymlresource file as an example. Use theoc explaincommand to learn the valid parameters for thehpa.spec.metrics.resource.targetattribute. Because the file is incomplete, you must update it first if you choose to use it.To test your work, use the
oc exec deploy/longload — curl localhost:3000/leakcommand to sends an HTTP request to the application/leakAPI endpoint. Each request consumes an additional 480 MiB of memory. To free this memory, you can use the~/DO180/labs/reliability-review/free.shscript.Before you create the horizontal pod autoscaler resource, scale down the application to one pod.
[student@workstation ~]$
oc scale deployment/longload --replicas 1deployment.apps/longload scaledEdit the
~/DO180/labs/reliability-review/hpa.ymlresource file. You can retrieve the parameters for theresourceattribute by using theoc explain hpa.spec.metrics.resourceandoc explain hpa.spec.metrics.resource.targetcommands.apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: longload labels: app: longload spec: maxReplicas:3minReplicas:1scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: longload metrics: - type: Resource resource:name: memory target: type: Utilization averageUtilization: 60Use the
oc applycommand to deploy the horizontal pod autoscaler.[student@workstation ~]$
oc apply -f ~/DO180/labs/reliability-review/hpa.ymlhorizontalpodautoscaler.autoscaling/longload createdIn the second terminal, run the
watchcommand to monitor theoc get hpa longloadcommand. Wait for thelongloadhorizontal pod autoscaler to report usage in theTARGETScolumn. The percentage on your system probably differs.[student@workstation ~]$
watch oc get hpa longloadEvery 2.0s: oc get hpa longload workstation: Fri Mar 10 05:15:34 2023 NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE longload Deployment/longload13%/60%1 3 1 75sLeave the command running and do not interrupt it.
To test your work, run the
oc exec deploy/longload — curl localhost:3000/leakcommand in the first terminal for the application to allocate 480 MiB of memory.[student@workstation ~]$
oc exec deploy/longload -- curl -s localhost:3000/leaklongload-7ddcc9b7fd-72dtm: consuming memory!In the second terminal, after two minutes, the
oc get hpa longloadcommand shows the memory increase. The horizontal pod autoscaler scales up the application to more than one replica. The percentage on your system probably differs.Every 2.0s: oc get hpa longload workstation: Fri Mar 10 05:19:44 2023 NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE longload Deployment/longload
145%/60%1 325m18sTo test your work, run the
~/DO180/labs/reliability-review/free.shscript in the first terminal for the application to release the memory. Ensure that the pod that frees the memory is the same pod that was consuming memory. Execute thefree.shscript several times if necessary.[student@workstation ~]$
~/DO180/labs/reliability-review/free.shlongload-7ddcc9b7fd-72dtm: releasing memory!In the second terminal, after ten minutes, the
oc get hpa longloadcommand shows the memory decrease. The horizontal pod autoscaler scales down the application to one replica. The percentage on your system probably differs.Every 2.0s: oc get hpa longload workstation: Fri Mar 10 05:19:44 2023 NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE longload Deployment/longload
12%/60%1 3115m28sPress Ctrl+C to quit the
watchcommand. Close that second terminal when done.