My first thought is to let the LLMs infer the exact applications on the screenshot. Upon further consideration, it may be possible to let the LLMs infer the intention on the screenshot. Such as what was I doing in this screenshot? It's a direction of exploration.
It depends on what this app is. if it's an efficient improvement tool, it may not need to know the exact applications. Just need to know the behavior of the user.
My first thought is to let the LLMs infer the exact applications on the screenshot. Upon further consideration, it may be possible to let the LLMs infer the intention on the screenshot. Such as what was I doing in this screenshot? It's a direction of exploration.
It depends on what this app is. if it's an efficient improvement tool, it may not need to know the exact applications. Just need to know the behavior of the user.