Changes for Transformers Uplift v5.2.0 in tt-xla by ssaliceTT · Pull Request #529 · tenstorrent/tt-forge-models

ssaliceTT · 2026-03-17T05:04:52Z

Ticket

Problem description

Transformers is being uplifted to 5.2.0 from 4.57.1 requiring many changes to fix the test that broke on the major uplift.

What's changed

FeatureExtractor → ImageProcessor — detr, maskformer, yolos_small: replaced deprecated
DetrFeatureExtractor, MaskFormerFeatureExtractor, YolosFeatureExtractor with their ImageProcessor equivalents
encode_plus() → tokenizer() — huggyllama, mistral, roberta: replaced tokenizer.encode_plus(...) with the
direct tokenizer(...) call
trust_remote_code=True removed for phi3 — phi3 is now upstream in transformers; removed from
AutoTokenizer, AutoConfig, and model_kwargs across phi3/causal_lm, phi3/phi_3_5, phi3/seq_cls, phi3/token_cls
VLM sub-module path fix for pixtral — model.language_model / model.vision_tower no longer directly exposed
on the top-level model in 5.x; added _get_language_model() / _get_vision_tower() helpers that check both
paths
tie_weights() signature fix — openvla/pytorch/src/modeling_prismatic.py: updated override to accept and
forward **kwargs to match the new PreTrainedModel.tie_weights(**kwargs) signature
AutoProcessor(trust_remote_code=True) → local processor for openvla_oft — added processing_prismatic.py
(copied from the openvla source), replaced AutoProcessor.from_pretrained(..., trust_remote_code=True) with
explicit PrismaticImageProcessor + PrismaticProcessor instantiation
Sentencizer — XLMRobertaSelfAttentionWithAdapters rewritten — XLMRobertaSdpaSelfAttention was removed in
5.x (consolidated into unified dispatch); rewrote the adapter attention class to use eager_attention_forward
from the new unified API (~170 lines of old attention code replaced)
HfFolder.get_token() → HfApi().token — sentencizer/pytorch/src/utils.py: HfFolder removed from
huggingface_hub
is_torch_fx_available / is_torch_greater_or_equal_than_1_13 removed —
deepseek/deepseek_ocr/pytorch/src/modeling_deepseekv2.py: removed the guards, left the torch.fx.wrap call
unconditional since PyTorch >= 2.1 always has torch.fx
EasyDel JAX models pinned to transformers==4.57.1 — added per-model requirements.txt pinning
transformers==4.57.1 for: falcon, gpt2, llama, phi1, phi1_5, phi2, phi3, qwen_2_5, qwen_2_5_coder, qwen_3,
whisper (all JAX/EasyDel variants). EasyDel requires the older transformers API.
Module-level → method-level imports for JAX loaders — falcon/jax and mistral/causal_lm/jax: moved
transformers imports inside the method body to avoid importing before the per-model pip install (which sets
the pinned version) has run

Checklist

New/Existing tests provide coverage for changes

…. Sentencizer had bigger rewrite done. Need to see if it works in CI.

… which creates non-splat constant tensors that Shardy cannot shard.

…sion to the previous one.

…ter for just the variant that needs it.

### Ticket N/A ### Problem description Uplift the transformers library from `4.57.1` to `5.2.0` to broaden model support and enable new models such as GLM-5 to run on our stack. Transformers 5.x is a major version with several breaking changes that required fixes across both tt-xla and tt-forge-models. ### What's changed #### Transformers 5.x breaking changes and how we addressed them **Flax/JAX backend removed (transformers 5.0, [PR #40760](huggingface/transformers#40760 All `FlaxXxx` model classes were removed from the library. As a result: - All JAX tests backed by `FlaxPreTrainedModel` are now marked `NOT_SUPPORTED_SKIP` (82 test entries updated in `test_config_inference_single_device.yaml`). Affected model families: albert, bart, beit, bert/masked_lm, longt5, mt5, t5, regnet, resnet, vit, dinov2, bloom, clip, distilbert, electra, gpt_j, gpt_neo, gpt_sw3, mistral, opt, roberta, roformer, squeezebert, wav2vec2, whisper, xglm, xlm_roberta, marian_mt, mbart50, bigbird, pegasus, vision_text_dual_encoder - Removed `FlaxPreTrainedModel` from the `Model` type alias in `types.py` and from `isinstance` checks and parameter handling in `jax_model_tester.py` and `dynamic_jax_model_tester.py` - Four mamba tensor-parallel test entries removed from `test_config_inference_tensor_parallel.yaml` (Flax mamba model class was removed) - EasyDel-based JAX models (falcon, phi1, phi1_5, phi2, phi3, gpt2, qwen 2.5/coder/3, llama, whisper) remain functional and are pinned to `transformers==4.57.1` via per-model `requirements.txt` in tt-forge-models, since EasyDel itself requires the older transformers API **Legacy cache format removed (transformers 5.0–5.2, [PR #41378](huggingface/transformers#41378), [PR #43168](huggingface/transformers#43168 `to_legacy_cache()`, `from_legacy_cache()`, `get_usable_length()`, and all deprecated `Cache` subclasses were removed. Changes made: - Updated `kimi_k2/modeling_deepseek.py`: replaced `DynamicCache.from_legacy_cache()` with a manual layer-by-layer construction, replaced `to_legacy_cache()` with a manual tuple, and replaced `get_usable_length()` with `get_seq_length()` - Updated `kimi_k2/test_kimi_k2.py`: replaced tuple-indexed shard spec keys (`args[3][0][0]`) with the new layer attribute API (`args[3].layers[0].compressed_kv`), and added `lazy_initialization()` calls for `StaticCache` layers **Unified attention interface (transformers 5.x)** Attention modules no longer return `attn_weights` when using the unified SDPA/flash/eager dispatch path, and require `_attn_implementation` to be set explicitly on the config. Updated Gemma and Mistral attention tests to: - Set `config._attn_implementation = "sdpa"` before constructing attention modules - Drop `attn_weights` from the return value of the inner attention call **`XXXFeatureExtractor` classes removed (transformers 5.0, [PR #41174](huggingface/transformers#41174 All legacy vision `FeatureExtractor` classes were replaced by `ImageProcessor` equivalents. Updated in tt-forge-models: - `detr`: `DetrFeatureExtractor` → `DetrImageProcessor` - `maskformer`: `MaskFormerFeatureExtractor` → `MaskFormerImageProcessor` - `yolos_small`: `YolosFeatureExtractor` → `YolosImageProcessor` **`encode_plus()` / `batch_encode_plus()` removed in favour of `__call__()` (transformers 5.0)** The legacy tokenizer encoding methods were formally removed. Changes made: - tt-forge-models (`huggyllama`, `mistral`, `roberta`): `tokenizer.encode_plus(...)` → `tokenizer(...)` - `examples/pytorch/sdxl-pipeline.py`: `tokenizer.batch_encode_plus(...)` → `tokenizer(...)` - `tests/torch/models/llama3/test_llama_step_n300.py`: `tokenizer.encode_plus(...)` → `tokenizer._encode_plus(...)` (private method still present in 5.x as the internal implementation; should ideally be `tokenizer(...)`) - `tests/torch/quality/image_gen/sdxl/pipeline.py`: replaced the private `tokenizer._encode_plus(...)` call (which broke in 5.x for list inputs with `padding="max_length"`) with the public `tokenizer(...)` interface with explicit `padding="max_length"`, `truncation=True`, and `return_tensors="pt"`. The old code produced mismatched sequence lengths for conditioned vs unconditioned tokens causing a `torch.cat` shape mismatch error. **`trust_remote_code` no longer needed for phi3 (transformers 5.x)** The phi3 model was upstreamed into the official transformers library and `trust_remote_code=True` is now unnecessary. Removed from `AutoTokenizer.from_pretrained`, `AutoConfig.from_pretrained`, and `model_kwargs` in the phi3 loader. **`torch.fx` support dropped (transformers 5.0, [PR #41683](huggingface/transformers#41683 `is_torch_fx_available()`, `is_torch_greater_or_equal_than_1_13`, and all `torch.fx` tracing guards were removed. Updated: - `deepseek_r1` (deepseekv2) loader in tt-forge-models - `kimi_k2/modeling_deepseek.py`: removed `is_torch_fx_available` import and the `_prepare_4d_causal_attention_mask` FX wrap block; replaced `rope_scaling["type"]` dict access with `.get()` to guard against missing keys in newer config formats **VLM sub-module path changed (transformers 5.x, [PR #42156](huggingface/transformers#42156 Vision-language models no longer expose `model.language_model` directly at the top level; it is now accessed via `model.model.language_model`. Updated `mistral/pixtral` loader to add `_get_language_model()` and `_get_vision_tower()` helpers that handle both paths when building shard specs. **`AutoProcessor` with `trust_remote_code` removed for custom processors (transformers 5.x)** `AutoProcessor.from_pretrained(trust_remote_code=True)` no longer works for models with custom processing classes not registered in the transformers auto-mapping. Updated `openvla_oft` to explicitly instantiate `PrismaticImageProcessor` and `PrismaticProcessor` from the local `openvla/pytorch/src/` source. **`tie_weights()` signature changed (transformers 5.x)** `PreTrainedModel.tie_weights()` now passes through `**kwargs`. Updated the `tie_weights` override in `openvla/pytorch/src/modeling_prismatic.py` to accept and forward `**kwargs` to avoid a `TypeError` on model init. **`XLMRobertaSdpaSelfAttention` removed (transformers 5.x)** The separate SDPA attention class was consolidated into the unified attention dispatch. Rewrote `XLMRobertaSelfAttentionWithAdapters` in `sentencizer/pytorch/src/adapter_utils.py` to conform to the new `forward()` signature using `eager_attention_forward` from transformers. **`HfFolder.get_token()` removed (huggingface_hub)** `HfFolder` was removed in recent `huggingface_hub` versions. Updated `sentencizer/pytorch/src/utils.py` to use `HfApi().token` instead. **mamba2 JAX loader removed** `mamba2/causal_lm/jax` was removed as it was non-functional and incompatible with the pinned EasyDel version used by other JAX models. #### tt-xla infrastructure changes - **`transformers` removed from `_JAX_PURGE_SKIP`** (`tests/runner/requirements.py`): `transformers` was previously excluded from the `sys.modules` purge that `RequirementsManager` performs after a per-model pip install. This meant that when an EasyDel model installed `transformers==4.57.1`, the venv's 5.2.0 stayed cached in memory and the newly installed version was never visible to imports. Removing `transformers` from the skip list (keeping only `flax`, which has genuine module-level imports in JAX infra) ensures the installed version is correctly used. All JAX infra files were audited to confirm none hold module-level `transformers` references. - **Sparse MLP router output fix** (`python_package/tt_torch/sparse_mlp.py`): `GptOssTopKRouter` was updated to return a 3-tuple `(router_logits, router_scores, router_indices)` instead of 2. Updated all three MoE dispatch paths (`SparseMLP`, `A2aSparseMLP`, `A2aSparseStackedMlp`) to unpack accordingly and simplified the weighted-sum logic to use the compact scores tensor directly, removing a workaround that used `torch.gather` / one-hot einsum. - **Performance benchmark matrix** (`.github/workflows/perf-bench-matrix.json`): Updated all PyTorch benchmark entries from `transformers==4.57.1` to `transformers==5.2.0`. The `resnet_jax` and `bge_m3_encode` entries are intentionally kept at `transformers==4.57.1` — `FlaxResNetForImageClassification` was removed in 5.x, and `FlagEmbedding` (used by bge_m3) is not yet compatible with 5.x. - **LLM benchmark version check** (`tests/benchmark/benchmarks/llm_benchmark.py`): Updated `check_transformers_version()` to require exactly `5.2.0` instead of `<= 4.57.1`. Also removed the now-unnecessary `check_transformers_version()` guard from `examples/pytorch/llama.py`. - **Resnet codegen examples skipped** (`tests/examples/test_examples.py`): Added XFAIL entries for `jax/codegen/cpp/resnet.py` and `jax/codegen/python/resnet.py` since `FlaxResNetModel` was removed in transformers 5.x. - **`surya-ocr` unpinned** (`venv/requirements-dev.txt`): Removed the `surya-ocr==0.17.0` version pin. #### tt-forge models PR: tenstorrent/tt-forge-models#529 ### CI tests for reference: Manual Release test: https://github.com/tenstorrent/tt-xla/actions/runs/23179435697 Manual Manylinux release test: https://github.com/tenstorrent/tt-xla/actions/runs/23179426382 ### Checklist - [x] Fix `gpt_oss` failure - [x] Fix JAX-only CI workflows --------- Co-authored-by: Vladimir Zeljkovic <vzeljkovic@tenstorrent.com>

ssaliceTT requested review from AleksKnezevic, kmabeeTT, mrakitaTT, nvukobratTT, ppadjinTT, vkovinicTT and vzeljkovicTT as code owners March 17, 2026 05:04

ssaliceTT mentioned this pull request Mar 17, 2026

Transformers v5.2.0 Uplift tenstorrent/tt-xla#3371

Merged

2 tasks

ssaliceTT enabled auto-merge (squash) March 17, 2026 05:17

ssaliceTT disabled auto-merge March 17, 2026 05:52

ssaliceTT enabled auto-merge (squash) March 17, 2026 12:02

AleksKnezevic and others added 9 commits March 17, 2026 16:31

Changes for transformers uplift

5fd3aa4

Add file

2976567

fixed import errors for yolo_small, detr, maskformer and deepseek OCR…

f86a13e

…. Sentencizer had bigger rewrite done. Need to see if it works in CI.

fixed bert issue. Pre-compute 4D attention mask to bypass sdpa_mask()…

10323b3

… which creates non-splat constant tensors that Shardy cannot shard.

added per model requirements for EasyDel models. Set transformers ver…

c8a3f0a

…sion to the previous one.

moving to method-level imports

b96fc1e

removed bert fix added previously as tt-mlir change covers it.

a978b99

modified mistral loader for jax to have FlaxPretrainedModel import la…

ba745e0

…ter for just the variant that needs it.

ran pre-commit

776b940

AleksKnezevic approved these changes Mar 17, 2026

View reviewed changes

ssaliceTT force-pushed the aknezevic/hf_uplift branch from fb03e38 to 776b940 Compare March 17, 2026 16:31

ssaliceTT disabled auto-merge March 17, 2026 16:32

ssaliceTT merged commit a106f38 into main Mar 17, 2026
2 checks passed

ssaliceTT deleted the aknezevic/hf_uplift branch March 17, 2026 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes for Transformers Uplift v5.2.0 in tt-xla#529

Changes for Transformers Uplift v5.2.0 in tt-xla#529
ssaliceTT merged 9 commits intomainfrom
aknezevic/hf_uplift

ssaliceTT commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ssaliceTT commented Mar 17, 2026

Ticket

Problem description

What's changed

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants