GSOC 2026: Project 1 ->Build a GUI Agent with local LLM/VLM and OpenVINO #34830

Hmm-1224 · 2026-03-21T13:57:54Z

Hmm-1224
Mar 21, 2026

Hello @openvino-dev-samples && zhou Wu,
I am happy to see you again. I would like to work on this project. I am very comfortable in python and very known to openvino toolkits. I have worked on some PRs as well, in which two got merged. Also, i had worked on gesture control last year, where i get more experience. To be very honest, i don't have strong background in prompt engineering and agentic workflows. But i am open to learn, also i like to learn new things.
I would love to get started, could you please suggest any prerequisite tasks where I can begin contributing?

openvino-dev-samples · 2026-03-23T01:17:25Z

openvino-dev-samples
Mar 23, 2026

Hi @Hmm-1224 nice to see you, and looks you already submitted your contribution to OpenVINO last year ? If yes, please no worry on prerequisite tasks and focus on the project.

0 replies

Hmm-1224 · 2026-03-23T17:08:31Z

Hmm-1224
Mar 23, 2026
Author

From my understanding, this project is about creating a GUI based agent that will interact with computer interface like we human operates on a system. But i think the core idea of this project is to bridge natural language instructions with real-world UI actions.
The system will take a user instruction (for eg:- open spotify and play this particular music), interpret it using a language model, and then interact with the system interface by observing the screen and performing actions like clicking, navigating, typing etc. This requires combining both LLM(it will tell agent what to do after getting a signal/input based on reasoning + planning) and VLM(which basically provides perception to agent).

Agentic workflow is very crucial for this project, here system will not simply perform task in a single command, but it will require planning steps, executing actions, observing the result and adjust the behaviour accordingly.. Basically the work will be distributed and performed iteratively.. This will make system more robust and similar to how human actually interact with it.

Though i have not completely gone through the reference projects but overall projects like UI-TARS-desktop and MobileAgent suggest that the system will rely on screen understanding, structured reasoning, and sequential execution.. My interpretation is that this project extends similar ideas into a desktop environment with a strong focus on local inference and usability.

ABove are my understanding, please do clarify i misunderstand something or miss something.

My proposal is given below:

I plan to design the system as a modular pipeline with the following components:

User Interface Layer
A simple desktop GUI where users can give some instructions via text/voice, observe agent actions and logs and can montion the intermediate steps
Planning & Reasoning Module (LLM)
It will converts user instruction into structured steps using agentic worflows, maintains task context and handles multi-step reasoning.
Screen Understanding Module (VLM)
It captures screenshots and interprets UI elements (buttons, text fields, menus) using VLM and OCR, approximating their locations and roles instead of relying on exact UI structure
Action Execution Layer
It will performs actions such as mouse movement, clicks, and keyboard input. After action, system should verify outcome via screen again. It uses libraries like pyautogui for system interaction.
Agent workflow
It will implement an iterative workflow:
Alogrithm looks like => Plan -> Execute -> Observe -> Update
This ensures adaptability and robustness in dynamic UI environments
OpenVINO Integration
Deploy one model locally using OpenVINO
Optimize inference for CPU-based execution
Reduce latency and dependency on external APIs

Expected Deliverables :
A functional desktop application with GUI
An agent capable of executing multi-step UI tasks
Integration of local model inference via OpenVINO
Documentation and reproducible setup

I am very open to your suggestions on area of improvements. Please do have a look on it and do let me know, if it works or improvements required. ALso, any specifc thing you would like to see in a strong proposal?

0 replies

Hmm-1224 · 2026-03-24T05:52:29Z

Hmm-1224
Mar 24, 2026
Author

Hii @openvino-dev-samples ,
Also could you please tag her(Zhuo wu) to review my proposal too?

1 reply

openvino-dev-samples Mar 24, 2026

@zhuo-yoyowz

Hmm-1224 · 2026-03-24T06:28:26Z

Hmm-1224
Mar 24, 2026
Author

@openvino-dev-samples && @zhou wu

While exploring more on this, I also looked into some specific models which can be used here. For LLM, I came across lightweight models like Phi-2 / Phi-3 and some LLaMA-based variants, which seem suitable for local deployment and can be used mainly for structured planning rather than heavy conversational reasoning. For VLM, I explored models like LLaVA, but considering performance and complexity, I am thinking to initially go with a simpler hybrid approach using Tesseract for extracting on-screen text along with some basic UI heuristics (like approximate positions and layout understanding), and later extend it to a proper VLM for more advanced perception if required. Overall, it should help in reducing the latency and will improve performance..

For OpenVINO integration, I plan to deploy at least onee model locally using OpenVINO and explore optimizations like model conversion and quantization to make inference efficient on CPU. Along with this, I am also planning to build a small prototype pipeline (screenshot -> OCR -> action -> execution -> observe) to validate the approach early.

Please do let me know if I misunderstood something or if you would suggest any different direction or improvements.

1 reply

openvino-dev-samples Mar 24, 2026

Hi, your understanding is correct. Since we can not expect a local VLM/LLM to do everything, you should design some scenarios for your project, and focus on the optimization of these scenarios.

Hmm-1224 · 2026-03-24T09:06:07Z

Hmm-1224
Mar 24, 2026
Author

@openvino-dev-samples && @zhuo-yoyowz
Thanks for you remarks. I understand my previous design relied on broader aspect rather than a specific scenario.
I will take scenario of spotify as i have mentioned earlier, this scenario is about taking a natural language instruction like open Spotify and play a song and bridging it into actual UI actions.

Modular pipline design:

User Interface Layer
A simple desktop GUI where user gives instruction via text/voice.
Shows agent actions, logs, and intermediate steps.

2.Planning and reasoning module (LLM)
i)Converts instruction into structured steps:
ii) Launch Spotify application
ii)Locate search bar
iv)Type song name
v)Select correct result from list
vi)Press play button
vii)Maintains task context and handles multi-step reasoning.
viii)Lightweight local LLM (Phi-3 / LLaMA variant) mainly for structured planning rather than heavy conversation.

3.Screen understanding module (OCR + heuristics / later VLM)
i) Captures screenshot and interprets UI elements.
ii) Detects text labels (search, song titles, play button) via OCR.
iii)Approximates bounding boxes for clickable regions.
iv)Optimization: cache common UI layouts (Spotify search bar position, play button location) to reduce repeated OCR calls.

Action Execution Layer
i)Performs clicks, typing, and navigation using pyautogui.
ii)Verifies outcome by checking if the play button changes state (pause icon visible).
iii)Optimization: minimize redundant verification by detecting state changes smartly.

5.Agent Workflow
i)Iterative cycle: Plan → Execute → Observe → Update.
ii) Ensures adaptability and robustness in dynamic UI environments.

Optimization:

I will focus on making the agent fast and efficient on local hardware through optimizations in three key areas.

1.OCR Optimization: So for OCR, instead of scanning the entire screen every time which is quite heavy, I will crop only the relevant areas like the search bar and player controls where the actual action happens. This alone should cut down the processing significantly. On top of that, I will convert the OCR model to INT8 using OpenVINO's post training optimization tool, which should give me around 2-3x faster inference. I am also planning to cache the positions of stable UI elements like the play button and search bar for about 30 seconds so that I don't have to keep calling OCR again and again for the same thing. With all these combined, I am hoping to bring down OCR latency from around 500ms to roughly 150ms.

2.LLM Planning Optimization: I will reduce planning latency by using structured prompt templates with placeholders instead of verbose natural language, which will lower the token count per request. I will also enable key-value caching in OpenVINO to reuse computed states across sequential planning calls, speeding up repeated inferences. My target is to bring planning latency down from roughly 800ms to about 300ms.

3.Workflow Optimization: I will implement state change detection by comparing pixel hashes of target areas before and after actions, eliminating the need for re-OCR in most successful actions. For retries, I will use exponential backoff with intervals of 0.5, 1, and 2 seconds instead of fixed waiting periods. Combined with UI layout caching, these optimizations should reduce the number of OCR calls per task from 8-10 to just 2-4, cutting end-to-end task time from 12-15 seconds down to 6-8 seconds.

OpenVINO Integration: I will deploy at least one model locally using OpenVINO, applying FP16 or INT8 quantization to optimize for CPU inference. I will benchmark inference times before and after optimization to clearly show the performance improvements achieved.

I believe this scenario will be a concrete usecase, but if you want me to incline in another scenario, i will be happy to do so.
Regarding optimization, i am open for your suggestion on area of improvements.

0 replies

Hmm-1224 · 2026-03-25T05:42:48Z

Hmm-1224
Mar 25, 2026
Author

@openvino-dev-samples && @zhuo-yoyowz
I have implemented a small prototype for the Spotify scenario. I have kept the pipeline simple and partially hardcoded..
Please do have a look on the video i shared. Sorry for the bad quality actually my laptop screen recorder is not working..

prototype.mp4

0 replies

KarSri7694 · 2026-03-25T13:35:38Z

KarSri7694
Mar 25, 2026

Hi @Hmm-1224 ,

Your prototype has such fast processing speed, what optimizations did you apply to increase inference speed?
also is this running on openvino backend on intel processor or on CUDA device?

0 replies

Hmm-1224 · 2026-03-25T15:15:20Z

Hmm-1224
Mar 25, 2026
Author

Hi @KarSri7694,

Thanks! The prototype is currently running on CPU without OpenVINO or CUDA. The fast speed comes from workflow optimizations like hardcoding stable UI positions, minimal delays in pyautogui, and selective waiting instead of full OCR or heavy model inference.
The prototype is just to show how whole mechanism gonna work, i have not used any advanced features that i have mentioned in my proposal yet..

0 replies

Hmm-1224 · 2026-03-26T12:13:15Z

Hmm-1224
Mar 26, 2026
Author

Hii @openvino-dev-samples && @zhuo-yoyowz,
I have created a draft proposal. I am attaching the document, please do have a look on it. Please do tell how to make it more strong

Regards,
sonal

1 reply

openvino-dev-samples Mar 27, 2026

Hi @Hmm-1224 Some comments:

Please add some diagram to explain the workshop/pipeline of your project. e.g which model would you select, what the roles of these models in your system, and how they will co-work with each other.
Could you share more on what the final results will look like in detail ? for example, listing what do you expect your system can do in perspective of end user?

Hmm-1224 · 2026-03-27T18:41:59Z

Hmm-1224
Mar 27, 2026
Author

@openvino-dev-samples && @ @zhuo-yoyowz
Thanks for the suggestions! I’ve updated my proposal, included a diagram under Project Workflow which basically tells which model i have selected and what is the specific role of that model. Also, as you mentioned i have added delieverables too.

I want to clarify one point that is since I’m using a combination of LLM + OCR, I’m not fully confident that OCR alone can handle all cases reliably. That’s why I’ve added VLM as an optional component, though my initial focus will remain on OCR. I’ve chosen models that are easier to optimize and integrate with OpenVINO for efficient local inference.

I have mailed my proposal via mail through 23b0910@iitb.ac.in
Please once go through it, share your opinion if it is fine or should i work more on any aspect of project. It truly matters to me.

0 replies

Hmm-1224 · 2026-03-28T14:44:54Z

Hmm-1224
Mar 28, 2026
Author

@openvino-dev-samples, also i made some changes in VLM model. previously i chose BLIP-2, i went through it pros and cons, and it seems it is very heavy and not ideal for local deployment. Also, it has very high inference cost, which makes it impractical for my use case. That's why i switched to MiniCPM-V, which is lightweight and easier to optimize with INT8/FP16 quantization for CPU inference. Since VLM is optional in my design, MiniCPM‑V fits better as a fallback without increasing scope unnecessarily.

0 replies

Hmm-1224 · 2026-03-31T03:26:03Z

Hmm-1224
Mar 31, 2026
Author

@openvino-dev-samples && @zhuo-yoyowz,
Hii, i have submitted my proposal on GSOC portal. Thanks for your suggestions an guidance.
Looking forward to work with you this summer.

0 replies

GSOC 2026: Project 1 ->Build a GUI Agent with local LLM/VLM and OpenVINO #34830

Uh oh!

Hmm-1224 Mar 21, 2026

Replies: 12 comments · 3 replies

Uh oh!

openvino-dev-samples Mar 23, 2026

Uh oh!

Uh oh!

Hmm-1224 Mar 23, 2026 Author

ABove are my understanding, please do clarify i misunderstand something or miss something.

Uh oh!

Uh oh!

Hmm-1224 Mar 24, 2026 Author

Uh oh!

openvino-dev-samples Mar 24, 2026

Uh oh!

Uh oh!

Hmm-1224 Mar 24, 2026 Author

Uh oh!

openvino-dev-samples Mar 24, 2026

Uh oh!

Uh oh!

Hmm-1224 Mar 24, 2026 Author

Uh oh!

Uh oh!

Hmm-1224 Mar 25, 2026 Author

Uh oh!

KarSri7694 Mar 25, 2026

Uh oh!

Uh oh!

Hmm-1224 Mar 25, 2026 Author

Uh oh!

Uh oh!

Hmm-1224 Mar 26, 2026 Author

Uh oh!

openvino-dev-samples Mar 27, 2026

Uh oh!

Uh oh!

Hmm-1224 Mar 27, 2026 Author

Uh oh!

Hmm-1224 Mar 28, 2026 Author

Uh oh!

Hmm-1224 Mar 31, 2026 Author

Hmm-1224
Mar 21, 2026

Replies: 12 comments 3 replies

openvino-dev-samples
Mar 23, 2026

Hmm-1224
Mar 23, 2026
Author

Hmm-1224
Mar 24, 2026
Author

Hmm-1224
Mar 24, 2026
Author

Hmm-1224
Mar 24, 2026
Author

Hmm-1224
Mar 25, 2026
Author

KarSri7694
Mar 25, 2026

Hmm-1224
Mar 25, 2026
Author

Hmm-1224
Mar 26, 2026
Author

Hmm-1224
Mar 27, 2026
Author

Hmm-1224
Mar 28, 2026
Author

Hmm-1224
Mar 31, 2026
Author