10 KiB
Task 1 Ambiguity Handoff Addendum
Goal
Resolve the Task 1 blocker by moving ambiguous GitLab and SourceForge URL handling out of pure taxonomy heuristics and into provider-aware resolution.
Problem Restatement
The blocker is not that the classifier is missing a few more path rules.
The blocker is that some provider-hosted URL shapes do not carry enough information to determine final install semantics from path shape alone.
Two cases are responsible for the review churn:
- GitLab deep paths where a segment may be either a subgroup slug or a resource-like segment
- SourceForge
files/.../downloadpaths where the same suffix can represent either a concrete file download or a folder-style endpoint
Trying to settle those cases in resolve_query(...) forces the code into a false choice:
- accept ambiguous inputs too early and misclassify them
- reject provider-owned inputs too early and lose useful context
Design Decision
Adopt an ambiguity handoff model.
That means:
- the classifier remains authoritative only for cases it can determine with high confidence
- ambiguous provider-hosted inputs are preserved as provider-owned candidates rather than flattened into
Unsupported - provider adapters become the layer that decides whether an ambiguous input is:
- a supported repository or project source
- a supported exact download form
- a supported source with no installable artifact
- truly unsupported for that provider
Contract Boundary
Classification policy
The classifier should use a strict positive-matching contract.
Each input shape must land in exactly one of three buckets:
- accept as a definite supported source
- accept as an explicit provider-owned candidate
- reject as unsupported
This means the classifier should prefer a small allowlist of accepted shapes over an expanding catalog of bespoke rejection rules.
Negative rules are still allowed when needed to protect a known false-positive family, but they are defensive exceptions, not the main design strategy.
Classification must do
- identify definite GitHub, GitLab, SourceForge, direct URL, and file inputs
- accept only explicitly enumerated concrete shapes or explicitly enumerated candidate shapes
- preserve canonical locator hints when they are certain
- preserve enough raw path context for later provider-specific disambiguation
- continue classifying concrete artifact URLs as
DirectUrlwhen the classifier can say so confidently
Classification must not do
- grow by accumulating one-off rejection rules for every unsupported provider page family
- guess whether a GitLab deep path is a subgroup path or a resource page when the path shape is ambiguous
- guess whether a SourceForge nested
files/.../downloadpath is a file or folder endpoint when the path shape is ambiguous - perform provider-specific network discovery
Resolver layer must do
- own final interpretation of ambiguous provider-hosted inputs
- return structured outcomes through the adapter contract
- keep
UnsupportedSourcereserved for sources the adapter genuinely does not own - use
NoInstallableArtifactfor provider-owned inputs that are valid but not installable under current scope
Proposed Source Model Adjustment
Introduce an explicit handoff shape for ambiguous provider-owned inputs.
The minimal acceptable form is:
- preserve the original locator
- preserve provider ownership
- preserve any canonical parts that are certain
- add a signal that provider resolution is still required before install semantics are known
This can be modeled either as:
- a dedicated ambiguity marker on
SourceRef - additional normalized kinds representing provider-owned unresolved candidates
The preferred direction is additional normalized kinds, because they keep the ambiguity visible in tests and logs without adding a free-form boolean that can drift.
Illustrative shapes:
NormalizedSourceKind::GitLabCandidateNormalizedSourceKind::SourceForgeCandidate
The exact enum names are secondary. The important part is making unresolved provider ownership explicit.
Provider Responsibilities
GitLab
GitLab adapter logic should decide whether a GitLab-owned ambiguous input is:
- a valid repository locator
- a release-like source with concrete version semantics
- a provider-owned but non-installable resource page
- unsupported because it does not fit the adapter's supported contract
Initial scope should stay narrow:
- keep current definite repository and release-like support
- add only one or two ambiguous deep-path cases as a first expansion slice
- do not try to solve every GitLab resource URL family at once
SourceForge
SourceForge adapter logic should decide whether a SourceForge-owned ambiguous input is:
- a concrete latest-download install source
- a concrete direct artifact URL
- a provider-owned project or folder view with no installable artifact
- unsupported for current source scope
Initial scope should stay narrow:
- keep bare project URLs as provider-owned and non-installable
- keep
files/latest/downloadas the first concrete repository-backed install source - add exactly one nested
files/.../downloadambiguity case to the adapter decision path
Testing Strategy
The blocker should be resolved by shifting assertions to the right layer.
Classification tests
Update query_resolution coverage so ambiguous cases assert provider ownership and handoff state instead of asserting final install semantics.
Coverage should be organized around accepted-shape allowlists:
- accepted concrete shapes
- accepted candidate shapes
- a small number of representative false-positive guards
Examples:
- a concrete SourceForge artifact download still classifies as
DirectUrl - a definite GitLab repository form still classifies as
GitLab - an ambiguous GitLab deep path becomes a GitLab-owned candidate, not
Unsupported - an ambiguous SourceForge nested download path becomes a SourceForge-owned candidate, not prematurely direct or unsupported
Adapter contract tests
Add tests that assert adapters make the final decision for ambiguous handoff inputs.
Examples:
- GitLab candidate path resolves to supported repository semantics
- GitLab candidate path resolves to
NoInstallableArtifact - SourceForge candidate path resolves to
Resolved - SourceForge candidate path resolves to
NoInstallableArtifact
Install and failure tests
Keep install-flow tests focused on supported concrete outcomes.
Keep failure tests focused on the distinction between:
- unsupported query
- provider-owned source with no installable artifact
- runtime install or transport failure
Incremental Execution Plan
Phase 1: Lock the boundary
- update the design docs to state that classification only decides what it can know with certainty
- record that ambiguous provider-hosted inputs are a resolver concern
Phase 2: Add handoff representation
- extend the source model with explicit provider-candidate semantics
- thread that representation through the query classifier
Phase 3: Shift one GitLab ambiguity case
- add a failing classification test for an ambiguous GitLab deep path
- classify it as a GitLab-owned candidate
- add adapter contract coverage for the GitLab decision
Phase 4: Shift one SourceForge ambiguity case
- add a failing classification test for a nested
files/.../downloadambiguity case - classify it as a SourceForge-owned candidate
- add adapter contract coverage for the SourceForge decision
Phase 5: Tighten error reporting
- make sure ambiguous provider-owned inputs that do not yield installable artifacts surface as
NoInstallableArtifact - avoid regressing them into unsupported-query failures
Progress Update
Current implementation status in this branch:
- Phase 1 is complete. The classifier-versus-adapter boundary is now documented explicitly in this addendum.
- Phase 2 is complete.
GitLabCandidateandSourceForgeCandidatenow exist in the source model and are produced by classification for the narrow ambiguity cases under test. - Phase 3 is complete for the first GitLab slice.
https://gitlab.com/<group>/<subgroup>/releases/<repo>remains a classified candidate, but the GitLab adapter now resolves it as repository semantics with a derived canonical locator. - Phase 4 is complete for two SourceForge slices.
https://sourceforge.net/projects/<project>/files/releases/stable/downloadremains a classified candidate and now resolves as a provider-owned latest-download source.https://sourceforge.net/projects/<project>/files/releases/v*/downloadis now preserved as a provider-owned candidate and surfaces asNoInstallableArtifact. - Phase 5 is partially complete. Provider-owned ambiguous inputs now distinguish unsupported-query failures from no-artifact outcomes, and both GitLab and SourceForge have at least one adapter-owned positive resolution path.
The current intended classifier contract is:
- accept explicit supported shapes
- accept explicit candidate shapes
- reject everything else
That contract is intentionally stricter than heuristic best-effort classification and intentionally narrower than provider resolution.
What remains intentionally out of scope for this slice:
- additional GitLab candidate families beyond the first repository-style deep path
- broader SourceForge folder and version-path families beyond the
releases/stable/downloadand narrowreleases/v*/downloadrules - any network-backed provider discovery in classification
Success Criteria
This blocker is considered resolved when:
query_resolutionno longer oscillates over ambiguous provider-owned shapes- ambiguous provider-hosted URLs are no longer forced into final install semantics during classification
- adapters are the only place where ambiguous provider paths are interpreted fully
- failure reporting distinguishes unsupported inputs from provider-owned non-installable inputs
Non-Goals
- solving every ambiguous GitLab deep-path variant in one pass
- solving every SourceForge nested folder or version path in one pass
- introducing network discovery into the pure query classifier
- expanding current supported source scope beyond what the adapter tests can defend clearly