When I run the target detection task, I find that the input prompts are different, such as "building" and "house", and the detection results are very different. Are there any good suggestions to help me with the detection? Do I need to know your open vocabulary glossary?