Skip to main content

Deduplicate Before Ingestion

The bulk-publishing question every curator runs into eventually: "how do I avoid creating duplicate entities when I re-run my publish script?" The answer is to query the target space first, build a name → ID map, and skip records whose name already exists.

This recipe walks the full pattern: paginate everything of a type in the space, build the dedup map, then use it in your ingest loop.

Step 1 — Query everything of the type, paginated

query ListAllOfType($spaceId: UUID!, $typeId: UUID!, $after: Cursor) {
entitiesConnection(
spaceId: $spaceId
typeId: $typeId
first: 1000
after: $after
) {
totalCount
pageInfo { hasNextPage endCursor }
nodes { id name }
}
}
{
"spaceId": "41e851610e13a19441c4d980f2f2ce6b",
"typeId": "484a18c5030a499cb0f2ef588ff16d50"
}
Loading interactive query runner…

Step 2 — Build the dedup map

async function buildDedupMap(
spaceId: string,
typeId: string,
): Promise<Map<string, string>> {
const existing = new Map<string, string>(); // normalized name → entity id
let after: string | null = null;

do {
const data = await gql(query, { spaceId, typeId, after });
for (const e of data.entitiesConnection.nodes) {
if (!e.name) continue; // skip nameless block entities
existing.set(e.name.trim().toLowerCase(), e.id);
}
after = data.entitiesConnection.pageInfo.hasNextPage
? data.entitiesConnection.pageInfo.endCursor
: null;
} while (after);

return existing;
}

The trim().toLowerCase() step matters — it catches "OpenAI" vs " OpenAI" vs "openai" as the same entity. Decide your normalization rule before publishing and stick with it both at build time and at lookup time.

Step 3 — Use the map in the ingest loop

const existing = await buildDedupMap(SPACE_ID, PROJECT_TYPE_ID);
console.log(`Found ${existing.size} existing projects in target space`);

const ops: Op[] = [];
let created = 0, skipped = 0;

for (const record of incomingRecords) {
const key = record.name.trim().toLowerCase();
if (existing.has(key)) {
skipped++;
continue; // already there — don't republish
}
const { id, ops: createOps } = Graph.createEntity({
name: record.name,
types: [PROJECT_TYPE_ID],
values: extractValues(record),
});
ops.push(...createOps);
existing.set(key, id); // important: update map within batch
created++;
}

console.log(`Created: ${created}, Skipped: ${skipped}`);

Why this matters

Without dedup, every re-run of your publish script creates fresh duplicate entities. After a few runs the space has 3-5x as many "Stable Diffusion" entities as it should, none of them linked to each other, and queries return all of them. The fix is cheap (one paginated query at the start) but easy to forget.

Notes

  • Updating instead of skipping: if the incoming record has new values not yet on the existing entity, use Graph.updateEntity({ id: existing.get(key)!, values: ... }) instead of skipping. The dedup map gives you the existing ID; the update path adds only what's missing.
  • Names can collide across types: you can have a Project named "Apple" and an Organization named "Apple". Always pass typeId to entitiesConnection so your map is type-scoped.
  • Cross-space dedup is harder: the recipe above is space-scoped. If you need "is this entity already published anywhere in Geo?", you have to either query each candidate space or use the global search query (fuzzy, untyped). For most curator workflows, space-scoped is the right answer.
  • trim().toLowerCase() is a starting normalization — for serious dedup also strip punctuation, fold accents (é → e), and collapse internal whitespace. Pick a rule, document it, apply consistently.