GSOC 2026: Project 1 ->Build a GUI Agent with local LLM/VLM and OpenVINO #34830
Replies: 12 comments 3 replies
-
|
Hi @Hmm-1224 nice to see you, and looks you already submitted your contribution to OpenVINO last year ? If yes, please no worry on prerequisite tasks and focus on the project. |
Beta Was this translation helpful? Give feedback.
-
|
From my understanding, this project is about creating a GUI based agent that will interact with computer interface like we human operates on a system. But i think the core idea of this project is to bridge natural language instructions with real-world UI actions. Agentic workflow is very crucial for this project, here system will not simply perform task in a single command, but it will require planning steps, executing actions, observing the result and adjust the behaviour accordingly.. Basically the work will be distributed and performed iteratively.. This will make system more robust and similar to how human actually interact with it. Though i have not completely gone through the reference projects but overall projects like UI-TARS-desktop and MobileAgent suggest that the system will rely on screen understanding, structured reasoning, and sequential execution.. My interpretation is that this project extends similar ideas into a desktop environment with a strong focus on local inference and usability. ABove are my understanding, please do clarify i misunderstand something or miss something.My proposal is given below: I plan to design the system as a modular pipeline with the following components:
Expected Deliverables : I am very open to your suggestions on area of improvements. Please do have a look on it and do let me know, if it works or improvements required. ALso, any specifc thing you would like to see in a strong proposal? |
Beta Was this translation helpful? Give feedback.
-
|
Hii @openvino-dev-samples , |
Beta Was this translation helpful? Give feedback.
-
|
@openvino-dev-samples && @zhou wu While exploring more on this, I also looked into some specific models which can be used here. For LLM, I came across lightweight models like Phi-2 / Phi-3 and some LLaMA-based variants, which seem suitable for local deployment and can be used mainly for structured planning rather than heavy conversational reasoning. For VLM, I explored models like LLaVA, but considering performance and complexity, I am thinking to initially go with a simpler hybrid approach using Tesseract for extracting on-screen text along with some basic UI heuristics (like approximate positions and layout understanding), and later extend it to a proper VLM for more advanced perception if required. Overall, it should help in reducing the latency and will improve performance.. For OpenVINO integration, I plan to deploy at least onee model locally using OpenVINO and explore optimizations like model conversion and quantization to make inference efficient on CPU. Along with this, I am also planning to build a small prototype pipeline (screenshot -> OCR -> action -> execution -> observe) to validate the approach early. Please do let me know if I misunderstood something or if you would suggest any different direction or improvements. |
Beta Was this translation helpful? Give feedback.
-
|
@openvino-dev-samples && @zhuo-yoyowz Modular pipline design:
2.Planning and reasoning module (LLM) 3.Screen understanding module (OCR + heuristics / later VLM)
5.Agent Workflow Optimization: I will focus on making the agent fast and efficient on local hardware through optimizations in three key areas. 1.OCR Optimization: So for OCR, instead of scanning the entire screen every time which is quite heavy, I will crop only the relevant areas like the search bar and player controls where the actual action happens. This alone should cut down the processing significantly. On top of that, I will convert the OCR model to INT8 using OpenVINO's post training optimization tool, which should give me around 2-3x faster inference. I am also planning to cache the positions of stable UI elements like the play button and search bar for about 30 seconds so that I don't have to keep calling OCR again and again for the same thing. With all these combined, I am hoping to bring down OCR latency from around 500ms to roughly 150ms. 2.LLM Planning Optimization: I will reduce planning latency by using structured prompt templates with placeholders instead of verbose natural language, which will lower the token count per request. I will also enable key-value caching in OpenVINO to reuse computed states across sequential planning calls, speeding up repeated inferences. My target is to bring planning latency down from roughly 800ms to about 300ms. 3.Workflow Optimization: I will implement state change detection by comparing pixel hashes of target areas before and after actions, eliminating the need for re-OCR in most successful actions. For retries, I will use exponential backoff with intervals of 0.5, 1, and 2 seconds instead of fixed waiting periods. Combined with UI layout caching, these optimizations should reduce the number of OCR calls per task from 8-10 to just 2-4, cutting end-to-end task time from 12-15 seconds down to 6-8 seconds. OpenVINO Integration: I will deploy at least one model locally using OpenVINO, applying FP16 or INT8 quantization to optimize for CPU inference. I will benchmark inference times before and after optimization to clearly show the performance improvements achieved. I believe this scenario will be a concrete usecase, but if you want me to incline in another scenario, i will be happy to do so. |
Beta Was this translation helpful? Give feedback.
-
|
@openvino-dev-samples && @zhuo-yoyowz prototype.mp4 |
Beta Was this translation helpful? Give feedback.
-
|
Hi @Hmm-1224 , Your prototype has such fast processing speed, what optimizations did you apply to increase inference speed? |
Beta Was this translation helpful? Give feedback.
-
|
Hi @KarSri7694, Thanks! The prototype is currently running on CPU without OpenVINO or CUDA. The fast speed comes from workflow optimizations like hardcoding stable UI positions, minimal delays in pyautogui, and selective waiting instead of full OCR or heavy model inference. |
Beta Was this translation helpful? Give feedback.
-
|
Hii @openvino-dev-samples && @zhuo-yoyowz, Regards, |
Beta Was this translation helpful? Give feedback.
-
|
@openvino-dev-samples && @ @zhuo-yoyowz I want to clarify one point that is since I’m using a combination of LLM + OCR, I’m not fully confident that OCR alone can handle all cases reliably. That’s why I’ve added VLM as an optional component, though my initial focus will remain on OCR. I’ve chosen models that are easier to optimize and integrate with OpenVINO for efficient local inference. I have mailed my proposal via mail through 23b0910@iitb.ac.in |
Beta Was this translation helpful? Give feedback.
-
|
@openvino-dev-samples, also i made some changes in VLM model. previously i chose BLIP-2, i went through it pros and cons, and it seems it is very heavy and not ideal for local deployment. Also, it has very high inference cost, which makes it impractical for my use case. That's why i switched to MiniCPM-V, which is lightweight and easier to optimize with INT8/FP16 quantization for CPU inference. Since VLM is optional in my design, MiniCPM‑V fits better as a fallback without increasing scope unnecessarily. |
Beta Was this translation helpful? Give feedback.
-
|
@openvino-dev-samples && @zhuo-yoyowz, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello @openvino-dev-samples && zhou Wu,
I am happy to see you again. I would like to work on this project. I am very comfortable in python and very known to openvino toolkits. I have worked on some PRs as well, in which two got merged. Also, i had worked on gesture control last year, where i get more experience. To be very honest, i don't have strong background in prompt engineering and agentic workflows. But i am open to learn, also i like to learn new things.
I would love to get started, could you please suggest any prerequisite tasks where I can begin contributing?
Beta Was this translation helpful? Give feedback.
All reactions