How is positional embedding added here ?

How exactly is spatial correspondence maintained? I see that the masks are fed through cross attention mechanism in Unet layers, but I do not see any positional embedding taken. how is this handled or am I missing something?