April 30, 20269 min readAdvanced

Three Languages, One Sandbox

How /code-challenges runs JavaScript, Python (Pyodide), and Rust (Playground API) in the browser. Three execution backends behind one unified UX.

Running arbitrary user code in the browser sounds simple until you actually do it. The first version of this page handled JavaScript and shipped in an afternoon. Adding Python took two days. Adding Rust took another day, but only because Python had already forced the right abstractions. This is what those abstractions look like, why I picked them, and the three different cost models hiding under the same UI.

The shape of the problem

A user types code into a Monaco editor and clicks Run Tests. We need to: execute their code in isolation from our origin, run a series of expression-based test cases against it, capture stdout/stderr, kill the run if it loops forever, and surface compile or runtime errors as line markers in the editor, for three languages with completely different execution stories.

The three execution backends:

JavaScript: sandboxed iframe with srcdoc, evaluated in-process.
Python: Pyodide (CPython compiled to WASM) loaded into an iframe, also in-process.
Rust: no in-browser Rust compiler is viable for a portfolio (rustc.wasm is multi-minute compiles). Compilation and execution happen on play.rust-lang.org.

The naive read is "three completely different problems." The actual answer turned out to be one dispatcher, one output protocol, and one error-marker shape, with three runners that share applyDone and applyError.

typescript

function dispatchRun(mode: 'tests' | 'code') {
  if (!activeChallenge.value) return
  status.value = 'running'
  lastRunMode.value = mode
  resetOutputs()
  const lang = challengeLang(activeChallenge.value)
  if (lang === 'python') return runPython(mode)
  if (lang === 'rust') return runRust(mode)
  return runJs(mode)
}

Cost model #1: JavaScript, cheap iframes

Creating an iframe with srcdoc and posting a message is on the order of a millisecond. There's no good reason to reuse it across runs, and reuse would actively hurt: a previous run's function add would still be in scope when the next run defines a new one. So the JavaScript runner creates a fresh iframe per run, posts the code, listens for done or error, and tears down.

typescript

function runJs(mode: 'tests' | 'code') {
  const iframe = document.createElement('iframe')
  iframe.sandbox.add('allow-scripts')
  iframe.style.display = 'none'
  iframe.srcdoc = createJsSrcDoc()
  document.body.appendChild(iframe)

  let watchdog: ReturnType<typeof setTimeout> | null = null

  const onMsg = (ev: MessageEvent) => {
    if (ev.source !== iframe.contentWindow) return
    const { type, payload } = ev.data || {}
    if (type === 'ready') {
      watchdog = setTimeout(() => {
        applyError({ error: 'Execution timed out, likely an infinite loop.' })
        cleanup()
      }, 4000)
      iframe.contentWindow?.postMessage({ code: code.value, tests, mode }, '*')
    } else if (type === 'done') { applyDone(payload, mode); cleanup() }
    else if (type === 'error') { applyError(payload); cleanup() }
  }
  window.addEventListener('message', onMsg)
}

The watchdog deserves a note. It does not start when we create the iframe. It starts when the iframe sends ready. iframe boot time is variable, and burning the user's 4-second budget on something they don't control would produce false timeouts. The runner waits for the iframe to confirm it's listening, then starts counting.

Cost model #2: Python, long-lived runtime

Pyodide is around 10MB compressed and takes 3–6 seconds to initialize on a cold cache. Per-run iframes would mean every Run Tests click forces a multi-second pause. Unusable. So the Python runner keeps a single iframe alive across runs, with a tiny piece of state (pyReady) tracking whether Pyodide has finished loading.

typescript

const pyIframe = ref<HTMLIFrameElement | null>(null)
const pyReady = ref(false)

function ensurePython(): Promise<HTMLIFrameElement> {
  if (pyIframe.value && pyReady.value) return Promise.resolve(pyIframe.value)
  // create iframe, wait for postMessage 'ready' from the Pyodide bootstrap
}

async function runPython(mode: 'tests' | 'code') {
  const iframe = await ensurePython()

  const watchdog = setTimeout(() => {
    applyError({ error: 'Execution timed out, likely an infinite loop.' })
    destroyPython()
    cleanup()
  }, 4000)

  iframe.contentWindow?.postMessage({ code: code.value, tests, mode }, '*')
}

The catch with reuse: module-level definitions in one run leak into the next. A user who first writes def add and then changes it to def subtract would see the wrong function unless we explicitly isolated namespaces. Pyodide's runPython(code, { globals: ns }) takes any dict as the globals dict; we make a fresh one per run.

typescript

window.addEventListener('message', async (ev) => {
  const { code, tests, mode } = ev.data
  const userNs = pyodide.globals.get('dict')()
  pyodide.runPython(code, { globals: userNs, filename: '<exec>' })

  for (const t of tests) {
    const expectedPy = pyodide.toPy(t.expected)
    const actual = pyodide.runPython(t.code, { globals: userNs })
    const ok = pyEqual(actual, expectedPy)
    results.push({ ok, expectedStr: pyRepr(expectedPy), actualStr: pyRepr(actual) })
    actual?.destroy?.()
    expectedPy?.destroy?.()
  }

  send('done', { results, runtimeMs: Math.round(performance.now() - t0) })
})

Two details that cost me an hour each:

pyodide.toPy(jsValue) converts a JS value into a Python value via JSON conversion semantics: null becomes None, arrays become lists, plain objects become dicts. This is what makes expected: null in JS land compare correctly against return None in Python land.
Equality goes through Python: pyodide.runPython('__a__ == __b__') uses Python's deep ==, so [1, 2, 3] equals [1, 2, 3] the way a Python developer expects. Doing the comparison on the JS side via JSON.stringify would give the wrong answer for sets, frozensets, and any custom __eq__.

When the watchdog fires for Python, we don't just post a "stop" message; we destroy the iframe. A wedged Pyodide can't be unwedged from outside. The next run pays the cold-start cost again. Acceptable; it only happens after a real infinite loop.

Cost model #3: Rust, remote compile

Compiling Rust in the browser is technically possible (rust-analyzer ships a WASM build, and rustc-codegen-cranelift exists) but the binaries are 50MB+ and per-compile latency is measured in tens of seconds at best. Not worth it. The Rust Playground at play.rust-lang.org is a public Docker-sandboxed compile-and-run service that accepts CORS requests. The runner POSTs source and parses stdout.

The challenge becomes: how do you run multiple tests against user code in a single process invocation? You generate a main() that runs each test and prints structured markers to stdout, then parse them back.

typescript

function createRustSource(userCode: string, tests: Array<{ code: string; expected: any }>): string {
  const harness = tests.map(t => `    {
        let __t = std::time::Instant::now();
        let __res = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| -> String {
            format!("{:?}", { ${t.code} })
        }));
        let __dur = __t.elapsed().as_secs_f64() * 1000.0;
        let __expected_repr: String = format!("{:?}", { ${t.expected} });
        let (ok, actual_repr) = match __res {
            Ok(s) => (s == __expected_repr, s),
            Err(e) => (false, downcast_panic(e)),
        };
        println!("__T__|{}|{}|{}|{}|{:.2}",
            ok,
            __cc_escape(${JSON.stringify(t.code)}),
            __cc_escape(&__expected_repr),
            __cc_escape(&actual_repr),
            __dur);
    }`).join('\n')

  return `${userCode}

fn __cc_escape(s: &str) -> String {
    s.replace('\\\\', "\\\\\\\\")
     .replace('\\n', "\\\\n")
     .replace('|', "\\\\p")
}

fn main() {
${harness}
    println!("__TR_END__");
}`
}

A few things going on here:

std::panic::catch_unwind wraps each test individually. If one assertion fails with a panic, the rest still run. AssertUnwindSafe is necessary because arbitrary user code probably isn't UnwindSafe; we accept the small risk that internal state could be inconsistent post-panic, since the next test gets a fresh stack frame either way.
format!("{:?}", value) on both sides means we compare strings. That sounds wrong (value equality should not go through string formatting), but it's actually the simplest cross-type compare available without making the user write trait bounds. vec![1, 2, 3] debug-formats to "[1, 2, 3]" on both sides. String::from("cba") formats to "\"cba\"" on both sides.
The watchdog is an AbortController on the fetch, with a generous 25-second timeout because compilation is part of the loop, not just execution.

The escape that keeps the parser honest

The Rust harness prints lines like __T__|true|add(2, 3)|5|5|0.42. What if a test's expected value contains a pipe? Or a newline? Or a backslash? Without escaping, a vec!["a|b"] expected value would split into the wrong number of fields and corrupt the parse.

typescript

function unescapeCc(s: string): string {
  let out = ''
  for (let i = 0; i < s.length; i++) {
    if (s[i] === '\\' && i + 1 < s.length) {
      const nx = s[i + 1]
      if (nx === '\\') { out += '\\'; i++; continue }
      if (nx === 'n') { out += '\n'; i++; continue }
      if (nx === 'p') { out += '|'; i++; continue }
    }
    out += s[i]
  }
  return out
}

I considered hand-rolling JSON serialization on the Rust side instead. It would have worked but pulled in serde_json as a dependency on the playground (allowed, but slows compilation a few seconds), and the structured-marker format is easier to eyeball when debugging compilation failures. Three escapes, one decoder, no crate dependencies.

Three error formats, one editor marker

The most surprising win was that all three error stories collapsed to the same Monaco marker shape. Each runner parses its own diagnostic format and feeds the same struct.

typescript

function applyError(payload: any) {
  status.value = 'fail'
  errorMessage.value = payload.error
  if (payload.loc?.line) {
    markers.value = [{
      startLineNumber: payload.loc.line,
      endLineNumber: payload.loc.line,
      message: payload.error,
      severity: 'error',
    }]
  }
}

The user sees a red squiggle on the offending line of their code, regardless of whether the error came from a JavaScript ReferenceError, a Python NameError, or a Rust error[E0425]: cannot find value. They didn't have to learn three different mental models; we did the work once.

What I'd do differently

Move Python into a Web Worker, not just an iframe. The current setup gets origin isolation but not thread isolation. A runaway Python loop blocks the iframe's main thread, which is fine because we kill the iframe, but a Worker would let us terminate() without losing the Pyodide runtime warmup for the rest of the page.
Self-host a Rust sandbox if traffic ever justifies it. The Playground API is free and supports CORS, but it's rate-limited and undocumented as a stable contract. For a portfolio it's perfect; for a real product I'd run a Docker-based compiler service behind a queue.
Move the test schema to per-language test definitions instead of one shared array. Right now tests[i].expected is "JS value for JS/Python, Rust expression string for Rust", distinguishable by the parent challenge's language field. It works, but a RustChallenge | JsChallenge | PythonChallenge discriminated union would be more honest.
Add a "compare-by-JSON" mode for Rust as an alternative to Debug-format string equality, for cases where two values are equal but format differently (e.g. HashMap iteration order). Not needed for the seeded challenges, but it'd come up.

The takeaway

The interesting thing isn't any single piece. iframes, Pyodide, the Playground API are all well-known. The interesting thing is that three completely different execution stories ended up sharing one output protocol, one watchdog interface, one error-marker shape, and one status pill in the UI. The user clicks Run Tests, sees a result, and doesn't have to know whether their code ran in an iframe, in WASM, or on a server in a different country. That's what makes it feel like a single feature instead of three.