[model-gateway] add phi4 vision image processor#14430
Conversation
Summary of ChangesHello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates support for the Phi4-Vision (Phi-4-multimodal) model by adding a dedicated image processor. This new processor handles the unique preprocessing steps required by Phi4, including its specific dynamic HD transform, normalization, and attention mask generation. The changes ensure that the model gateway can correctly process visual inputs for the Phi4 model, expanding the range of supported multimodal models. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Phi4-Vision image processor, including the core Rust implementation, a Python script for generating test data, and comprehensive golden tests. The implementation is well-structured and correctly follows the logic of the reference implementation. My review focuses on improving code clarity, maintainability, and performance by leveraging more idiomatic Rust and library features, particularly from ndarray and image, to replace manual loops with more efficient slicing and high-level operations.
| // Global mask is all ones | ||
| for y in 0..mask_res { | ||
| for x in 0..mask_res { | ||
| combined_mask[[0, y, x]] = 1; | ||
| } | ||
| } | ||
|
|
||
| // Tile attention masks | ||
| for h_idx in 0..h_crops { | ||
| for w_idx in 0..w_crops { | ||
| let tile_idx = h_idx * w_crops + w_idx + 1; // +1 for global | ||
| let mask_y_start = h_idx * mask_res; | ||
| let mask_x_start = w_idx * mask_res; | ||
|
|
||
| for y in 0..mask_res { | ||
| for x in 0..mask_res { | ||
| combined_mask[[tile_idx, y, x]] = | ||
| attention_mask[[mask_y_start + y, mask_x_start + x]]; | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The construction of the combined_mask can be simplified by using ndarray's slicing and filling capabilities, which is more idiomatic and can improve performance by avoiding manual loops.
// Global mask is all ones
combined_mask.slice_mut(s![0, .., ..]).fill(1);
// Tile attention masks
for h_idx in 0..h_crops {
for w_idx in 0..w_crops {
let tile_idx = h_idx * w_crops + w_idx + 1; // +1 for global
let mask_y_start = h_idx * mask_res;
let mask_x_start = w_idx * mask_res;
let tile_mask = attention_mask.slice(s![
mask_y_start..mask_y_start + mask_res,
mask_x_start..mask_x_start + mask_res
]);
combined_mask
.slice_mut(s![tile_idx, .., ..])
.assign(&tile_mask);
}
}| for t in 0..num_crops { | ||
| for c in 0..3 { | ||
| for y in 0..base { | ||
| for x in 0..base { | ||
| pixel_values[[b, t, c, y, x]] = output[[t, c, y, x]]; | ||
| } | ||
| } | ||
| } | ||
| for y in 0..mask_res { | ||
| for x in 0..mask_res { | ||
| attention_masks[[b, t, y, x]] = mask[[t, y, x]]; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
| let rust_image_sizes: Vec<(u32, u32)> = match result.model_specific.get("image_sizes") { | ||
| Some(ModelSpecificValue::UintTensor { data, shape }) => { | ||
| let num_images = shape[0]; | ||
| (0..num_images) | ||
| .map(|i| (data[i * 2], data[i * 2 + 1])) | ||
| .collect() | ||
| } | ||
| _ => panic!("Expected image_sizes in model_specific"), | ||
| }; |
There was a problem hiding this comment.
The logic to reconstruct image_sizes from a flat vector can be made more concise and idiomatic by using chunks_exact instead of manual indexing.
let rust_image_sizes: Vec<(u32, u32)> = match result.model_specific.get("image_sizes") {
Some(ModelSpecificValue::UintTensor { data, .. }) => {
data.chunks_exact(2).map(|s| (s[0], s[1])).collect()
}
_ => panic!("Expected image_sizes in model_specific"),
};
Checklist