Hugo Fund Formation Platform14 Jun, 03:53 CET

Parser

Current ISL/MSL parser split, queue semantics, and review boundary.

Parser modules

  • src/services/ingest.ts (source upload + queue dispatch)
  • src/queues/parse-queue.ts (queue entrypoint)
  • src/services/parse.ts (orchestrator, claim, attempt lifecycle)
  • src/services/document-parser.ts (DOCX text extraction + single-pass classifier + prompt plumbing)
  • src/services/isl-classifier.ts (one-call classifier wrapper + error normalization)
  • src/services/isl-parse-context.ts (LP/fund context and coverage math)
  • src/services/parse-persistence.ts (draft writes + parse-state transition)
  • src/services/review.ts (human confirmation and live promotion)
Important boundary

The parser output is draft-only. It does not create operative side-letter applicability. Review confirmation is the boundary that writes live clauses and updates commitment_clause_assignments.

Current pipeline

#StageWhat happens
1UploadPOST /funds/:fundSlug/upload accepts a docx via upload-handler and normalizes file metadata.
2IngestingestUpload validates hash/duplicates, writes a source row with parse_state='queued', puts bytes to R2, and sends PARSE_QUEUE.
3QueuehandleParseQueue consumes the message, runs parseDocumentJob, and retries transient failures.
4Attempt claimParser opens document_parse_attempts (append-like log), claims source row by swapping parse_state from queued to parsing, and stamps parse_started_at.
5CleanupPrevious draft artifacts for the same source doc are deleted idempotently before running the classifier.
6Extract + classifyR2 payload is extracted and passed through the single-pass classifier in document-parser.ts.
7ContextresolveIslParsedContext resolves LP/fund hints, computes match fields, and computes uncovered paragraph count.
8Persist draftDraft clauses are written in clause_intake_drafts with source spans in clause_intake_sources and rows in document_extracted_paragraphs.
9Attempt closeAttempt row is closed as success, permanent_failure, or transient_failure. Source row transitions to parsed only on success.
10ReviewReview routes expose draft rows and call confirmDraftClauses only on submit, then set reviewed rows and promote to live.

Failure model

Failure classExamplesResult
Permanent Missing R2 object, malformed DOCX, classifier syntax violations, FK violation source_row.parse_state='parse_failed', attempt marked permanent_failure; queue returns success (no further automatic retry).
Transient AI timeout/network, runtime interruption, non-FK parser exceptions source_row.parse_state='queued', attempt marked transient_failure; queue retries or DLQ depending on retry policy.
DLQ Repeated transient failures Dead-letter messages set source state to dead_lettered and close open attempts.

Scheduled recovery thresholds

  • stuck queued rows older than 15 minutes are re-enqueued (src/queues/scheduled-sweep.ts).
  • stuck parsing rows with parse_started_at older than 30 minutes are moved back to queued and re-enqueued.

Classifier prompt shape

The parser still uses a one-pass classification flow built in code via buildIslSinglePassSystemPrompt / buildMslSinglePassSystemPrompt in document-parser.ts. The generated prompt, plus catalogue context, is applied to each queued source upload.

Ctrl+K to open · ↑↓ navigate · Enter go · Esc close
Copied