<div>
      <a href='/'>MENU</a>
    </div>
    <style>
    body {
      font-family: monospace;
    }
    </style>
    <style>
.shiki {
  padding: 12px 8px;
  border-radius: 8px;
}
</style>
    <pre>
  <p>Command to run test suite:
How often to report status/results: every 15 minutes or every 3 testsuite passes</p>
<p>Goal: Stress-test the full project suite over 100 runs to surface fails and flakes, triage each one as it appears, and leave behind durable evidence and a ranked report — without contaminating the run&#39;s results.</p>
<p>Setup (do first):</p>
<p>Pin the run to the current commit; record the SHA, date, OS, and node/browser versions at the top of the report. The 100 runs must all execute against this one SHA.
Create a results directory. Maintain a live HTML results report there that you rewrite after every batch (don&#39;t rely on chat history — assume your context will be summarized mid-run). Create a failures/ subdir for per-failure evidence.
Capture a resource baseline (disk, free memory, FD count, inotify watches, TCP sockets in TIME_WAIT, process/child-process count, CPU temp/usage, IO). Sample these after each testpass to ensure that they return to a reasonable level and are not increasing monotonically across several test passes. Also sample the resource metrics on a fixed cadence throughout and remember the highest, lowest, average and median values in a resource summary secton of the report. The system metrics are typically not interesting enough to report as absolute values every time you report status but they should be included in the full report, and if you find abnormal metric values or confirmed unexpected monotonically growing metrics this should be mentioned up front in the status updates with details in the full report.
The loop: Run the full suite (<command>) 100 times, sequentially. Number each run. On any fail or flake, keep the loop going and capture evidence immediately into failures/&lt;run#&gt;-<testname>/: exact command, run #, timestamp, git SHA, browser/env, full stdout+stderr, and any screenshots/server logs the harness produced. Never discard a failure&#39;s raw logs in favor of only a summary.</p>
<p>Triage (in a separate git worktree so the main run stays on the pinned SHA): For each distinct fail/flake, investigate in parallel and assign one status:</p>
<p>FULLY_UNDERSTOOD_FIX_AVAILABLE — certain of cause and best fix.
CANDIDATE_FIX_AVAILABLE — likely cause and fix, but needs a targeted re-run to confirm (run that in the worktree, not the main loop).
NEEDS_INVESTIGATION — needs separate deeper analysis (e.g. instrument the code, add temporary delays to force timing bugs, raise reproduction rate).
ENVIRONMENT_ISSUE — likely caused by factors outside the tests/code under test.
OTHER — nothing fits; explain why it&#39;s hard to categorize. Any code change you make to test a fix must be marked &quot;applied at run N&quot; and ideally tested out-of-band; don&#39;t silently mutate the suite mid-run.
Per-flake notes — quantify and look for patterns. For each, record: pass/fail counts and rate, first-seen and last-seen run #, deterministic vs intermittent, and any dimension that correlates: browser-specific (e.g. Firefox only), time-of-day (e.g. only during the UTC/CET date-boundary window), or onset (passed 100% for hours then suddenly failed often — investigate what changed). If you have tools to search the bug tracker or remote repos not cloned locally, use them during investigation.</p>
<p>Progress reports every <N runs or M minutes>: a ranked toplist of fails/flakes by frequency, each with its status and a one-line note, plus current resource readings if they have unexpected values (monotonically increasing/decreasing with each testpass, or just values that are not considered normal/healthy in general). Flag anything growing abnormally — especially leaked background processes that aren&#39;t cleaned up between runs (orphaned web/test servers, zombies); these are a known failure mode and need to be investigated and fixed, not just noted.</p>
<p>Final deliverable: results report HTML containing the ranked toplist with per-issue status/notes/evidence paths, the resource summary and trend over the run (baseline → typical level between runs, highest/lower/average resource metrics, with any leaks identified), and a prioritized list of recommended fixes. Report should also include info on the the pinned SHA + environment.</p>