Modify Muon optimizer by pass-lin · Pull Request #21885 · keras-team/keras

pass-lin · 2025-12-02T04:37:01Z

In this PR, we have introduced three improvements to Muon:

1.In the Muon optimizer, we often designate a subset of variables to be optimized with Adam. However, since different optimizers should not be assumed to have the same weight decay parameter, we addressed this by adding an adam_weight_decay parameter.

2.The current implementation of Muon mainly references the KellerJordan version. However, the Moonlight version is now widely recognized as superior. Compared to the KellerJordan version, the Moonlight version adjusts the learning rate from max(d_out/d_in, 1)**0.5 to max(d_out, d_in) * rate. The KellerJordan version assumes that the second dimension is the output dimension and the first dimension is the input dimension. As a general-purpose optimizer, we should not make such assumptions.
Additionally, the Moonlight version allows Muon and Adam to maintain the same weight decay and learning rate. We have added an rms_rate parameter to enable this feature, with a default value of 0.2. This parameter can be disabled by setting it to None. We have also adjusted some default parameters based on the Moonlight version.

3.When we initially submitted Muon optimizer, our understanding of Muon was not deep enough. As our research progressed, we discovered that Muon was designed with the assumption that the model is a Transformer. For 3D weights, it is necessary to assume that one dimension is d_in, and the other dimensions are reshaped to d_out. However, unlike the 2D case, the 3D scenario does not always have a clear distinction between d_in and d_out. Therefore, out of caution, we only use the Adam optimizer for cases other than 2D.

gemini-code-assist · 2025-12-02T04:37:17Z

Summary of Changes

Hello @pass-lin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the Muon optimizer by enhancing its flexibility and robustness. It introduces a dedicated weight decay parameter for Adam-optimized variables, updates the learning rate adjustment logic to a more widely recognized 'Moonlight version' for improved stability, and clarifies the optimizer's application scope to strictly 2D variables, delegating other dimensions to Adam. These changes aim to make Muon a more versatile and performant optimizer, especially in diverse model architectures.

Highlights

Separate Adam Weight Decay: Introduced an adam_weight_decay parameter to allow for distinct weight decay settings when the Adam optimizer is used for a subset of variables, addressing the issue of different optimizers potentially requiring different weight decay values.
Muon Learning Rate Adjustment (Moonlight Version): Updated the Muon optimizer's learning rate adjustment mechanism to align with the 'Moonlight version', which modifies the scaling factor from max(d_out/d_in, 1)**0.5 to max(d_out, d_in) * rate. A new rms_rate parameter (default 0.2) was added to enable this feature, allowing Muon and Adam to maintain consistent learning rates and weight decay, and can be disabled by setting it to None.
Strict 2D Variable Optimization for Muon: Refined the application scope of the Muon optimizer to strictly apply only to 2D variables. For any variables that are not 2D (e.g., 0D, 1D, 3D, or higher dimensions), the Adam optimizer will now be used, based on the understanding that Muon was originally designed with Transformer models and 2D weights in mind.
Default Parameter Adjustments: Adjusted several default parameters within the Muon optimizer, including adam_lr_ratio (from 0.1 to 1), weight_decay (from 0.1 to 0.004), and ns_steps (from 6 to 5), to better align with the 'Moonlight version' and improved practices.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

pass-lin · 2025-12-02T04:38:48Z

@fchollet @hertschuh
In this PR, we removed the TF bug fix from #21859, because #21797 provides a more accurate fix. In this PR, we focused on improving the muon algorithm, as the current Keras muon optimizer still has many issues. These issues stem from our initial insufficient understanding of muon.

gemini-code-assist

Code Review

This pull request introduces several valuable improvements to the Muon optimizer, including adding a separate weight decay for Adam, aligning with the Moonlight implementation, and restricting Muon updates to 2D variables for better stability. The changes are well-motivated and correctly implemented. I've provided a few suggestions to enhance docstring clarity and improve code readability, in line with the repository's style guide.

keras/src/optimizers/muon.py

codecov-commenter · 2025-12-02T04:43:16Z

Codecov Report

❌ Patch coverage is 80.95238% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.56%. Comparing base (a0004ee) to head (4e4f375).
⚠️ Report is 71 commits behind head on master.

Files with missing lines	Patch %	Lines
keras/src/optimizers/muon.py	80.95%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #21885      +/-   ##
==========================================
- Coverage   82.63%   82.56%   -0.07%     
==========================================
  Files         577      578       +1     
  Lines       59415    59804     +389     
  Branches     9313     9389      +76     
==========================================
+ Hits        49097    49377     +280     
- Misses       7913     8004      +91     
- Partials     2405     2423      +18

Flag	Coverage Δ
keras	`82.37% <80.95%> (-0.08%)`	⬇️
keras-jax	`62.89% <80.95%> (-0.44%)`	⬇️
keras-numpy	`57.46% <80.95%> (-0.11%)`	⬇️
keras-openvino	`34.33% <9.52%> (+0.02%)`	⬆️
keras-tensorflow	`64.42% <80.95%> (+0.30%)`	⬆️
keras-torch	`63.60% <80.95%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hertschuh

Have you tested this end-to-end with model.fit to see that it trains as expected?

Is there a way to compare with the original implementation?

hertschuh · 2025-12-02T19:59:22Z

keras/src/optimizers/muon.py

+        if len(variable.shape) != 2:
            return True


I'm not following this change. In the Moonlight implementation the criteria for using Muon is that the ndim >= 2: https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.py#L296

So for AdamW, the criteria would be ndim < 2.

I'm not following this change. In the Moonlight implementation the criteria for using Muon is that the : https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.py#L296`ndim >= 2`

So for AdamW, the criteria would be .ndim < 2

The optimization target of Muon is matrices. In the 3D case, reshaping into matrices is necessary for effective optimization. However, this involves too many assumptions, and introducing it would only unnecessarily increase complexity. In fact, Muon never considered the case of CNNs. It was designed with only 1D-Transformer scenarios in mind.

I'm not following this change. In the Moonlight implementation the criteria for using Muon is that the : https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.py#L296`ndim >= 2`

So for AdamW, the criteria would be .ndim < 2

In the original implementation of MoonLight, they could ensure that the optimization target is a Transformer model based on PyTorch. However, in the Keras implementation, we cannot guarantee this. For example, in a typical case with the PyTorch backend, if we mix keras.layers.Dense and torch.nn.Linear, then the optimization targets would simultaneously include variables of shape [d_out, d_in] and [d_in, d_out].

Similarly, if the optimization target is a 3D CNN model, the parameter meanings for the CNN model differ between the "channels_last" and "channels_first" formats. We lack reasonable assumptions to perform reshaping in such cases.

The Muon optimizer in Keras should be a general-purpose optimizer, and a general-purpose optimizer should not rely on too many assumptions. Therefore, we can only use the most conservative approach: we do not optimize anything other than matrices.

This is also the reason why we do not use the Keller Jordan Version. The Keller Jordan Version assumes that the optimized matrix must be either [d_out, d_in] or [d_in, d_out], while MoonLight does not require such assumptions.

Thanks for the explanation.

hertschuh · 2025-12-02T20:02:57Z

keras/src/optimizers/muon.py

        self.assign_sub(
            variable,
-            lr
-            * self.zeropower_via_newtonschulz5(g, self.ns_steps)
-            * max(1, shape[0] / shape[1]) ** 0.5,
+            self.lr_adjust(lr * update),
        )


nitpick: this can be on a single line now.

hertschuh · 2025-12-02T20:05:52Z

keras/src/optimizers/muon.py

+                continue
+            wd = ops.cast(weight_decay_value, variable.dtype)
+            lr = ops.cast(self.learning_rate, variable.dtype)
+            variable.assign(variable - variable * wd * lr)


Use self.assign(variable, variable - variable * wd * lr)

Use self.assign(variable, variable - variable * wd * lr)

keras/keras/src/optimizers/base_optimizer.py

Line 967 in 846a297

variable.assign(variable - variable * wd * lr)

Here, I maintain consistency with the existing weight decay implementation.

pass-lin · 2025-12-03T03:35:48Z

This is one of my pre-training tasks. Each epoch consists of 512 steps. It can be observed that MoonLight Muon not only trains normally, but also has a more stable loss compared to Adam.
@hertschuh

pass-lin · 2025-12-03T03:40:18Z

@hertschuh
I'd like to recommend a blog to you. It is written by Su Jianlin, the author of MoonLight and RoPE. However, considering that this is a Chinese blog, I have translated it into English for you by gemini

A Guide to the Muon Optimizer: Quick Start and Key Details

During this period, I believe many readers have already come across news about the Muon optimizer. Muon was first proposed around last October by Keller Jordan on Twitter, which means it has been around for just over a year. However, in this single year, Muon has already endured the training scrutiny of models with billions, hundreds of billions, and even trillions of parameters, which is enough to prove that it is a highly competitive optimizer.

Muon is now built into training frameworks like Torch and Keras, and even large-scale frameworks like Megatron are gradually starting to support it, indicating that it has gained general acceptance in the industry. However, for readers who are only familiar with Adam, how to quickly and effectively switch to Muon may still be a confusing matter. Therefore, this article attempts to provide a quick start guide.

Brief Introduction

The official proposer of Muon is Keller Jordan, who currently works at OpenAI. As mentioned earlier, Muon was first published on Twitter, and to this day, the author has only written a blog post, ["Muon: An optimizer for hidden layers in neural networks,"](https://www.youtube.com/watch?v=dQw4w9WgXcQ) instead of a paper. The author's view is that "whether or not it is written as a Paper has nothing to do with whether the optimizer is effective [original quote]."

Muon is an optimizer specifically customized for matrix parameters. There are some related works with similar characteristics, such as Shampoo, and the earlier Stochastic Spectral Descent, etc. Many works can be associated with Muon to a greater or lesser extent, but none completely cover Muon, so the author considers Muon to be a completely new work.

In China, the earliest article to popularize Muon was probably the author's blog post, Muon Optimizer Appreciation: An Essential Leap from Vector to Matrix, and the first large-scale model to verify Muon was our Moonlight, released in February, whose proposed Moonlight version of Muon was used in the subsequent trillion-parameter K2 model. Following K2, GLM-4.5 also used this Muon variant.

As Jeremy Bernstein, one of Muon's authors, stated in his blog post Deriving Muon, for the author of this article, Muon's uniqueness lies in the fact that it can be derived from more fundamental optimization principles and is effective in practice. In contrast, although Adam is also very effective, it is more like a heuristic solution.

Four Versions

This article does not intend to introduce the mathematical details or the implementation of Muon but focuses primarily on the technical details and precautions for switching from Adam to Muon. As mentioned, Muon is specifically for matrix parameter optimization and uses a non-element-wise update rule, which can be confusing for new users.

Furthermore, as far as the author knows, there are currently at least four slightly different versions of Muon, and this multi-version phenomenon exacerbates the confusion. If users don't understand the details, they might get poor results by setting the wrong hyperparameters (especially the learning rate). The following section will clarify these details. First, for a matrix $W \in \mathbb{R}^{d_{in} \times d_{out}}$ with gradient $G$, the four Muon variants are:

$$ M_t = \beta M_{t-1} + G_t $$

Naive Version (朴素版):

$$W_t = W_{t-1} - \eta_t (\mathrm{msign}(M_t) + \lambda W_{t-1})$$

Keller Jordan Version (Keller Jordan 版):

$$W_t = W_{t-1} - \eta_t \left( \sqrt{\max(1, d_{out}/d_{in})} \mathrm{msign}(M_t) + \lambda W_{t-1} \right)$$

MuP Version (MuP 版):

$$W_t = W_{t-1} - \eta_t \left( \sqrt{d_{out}/d_{in}} \mathrm{msign}(M_t) + \lambda W_{t-1} \right)$$

Moonlight Version (Moonlight 版):

$$W_t = W_{t-1} - \eta_t \left( 0.2 \times \sqrt{\max(d_{out}, d_{in})} \mathrm{msign}(M_t) + \lambda W_{t-1} \right)$$

To enable Nesterov momentum, replace $\text{msign}(M_t)$ with $\text{msign}(\beta M_t + G_t)$. The $\text{msign}$ operation is usually named zeropower_via_newtonschulz in implementation, but ordinary users do not need to worry about the specific implementation details.

The only difference between the four versions is the scaling factor before $\text{msign}$. The "Keller Jordan Version" and the "MuP Version" are largely similar, while the "Moonlight Version" is slightly more unique. Keras has only implemented the "Keller Jordan Version," while Torch has implemented the "Keller Jordan Version" and the "Moonlight Version." The Naive Version seems to be relatively uncommon. The author of this article frequently uses their self-written "MuP Version."

The Two Dimensions

Here, we must pay attention to an important detail: the "Keller Jordan Version" and the "MuP Version" are sensitive to the order of $d_{in}$ and $d_{out}$. Therefore, the first step is to clarify the meaning of $d_{in}$ and $d_{out}$; it is not the case that the first dimension of the matrix is always $d_{in}$ and the second dimension is $d_{out}$.

$d_{in}$ and $d_{out}$ refer to the input and output dimensions of the linear layer, respectively. Determining which is $d_{in}$ and which is $d_{out}$ depends on the specific implementation of the linear layer. For example, in Keras's Dense layer, the implementation is $xW$, so the matrix $W$'s first dimension is $d_{in}$ and the second is $d_{out}$. However, Torch's Linear layer implements $xW^\top$, so the matrix $W$'s second dimension is $d_{in}$ and the first is $d_{out}$.

Therefore, to implement the "Keller Jordan Version" of Muon, the scaling factor for Torch's Linear layer should be $\sqrt{\max(1, W\text{.shape}[0]/W\text{.shape}[1])}$, while for Keras, it should be $\sqrt{\max(1, W\text{.shape}[1]/W\text{.shape}[0])}$. Consequently, the current Keras Muon implementation is actually incorrect because it copied Torch's scaling factor implementation.

If you write your own model, you need to judge carefully based on your own code. Of course, if you find figuring this out too troublesome, you can consider using the "Moonlight Version," whose scaling factor is symmetric with respect to $d_{in}$ and $d_{out}$.

Hyperparameter Settings

After clarifying $d_{in}$ and $d_{out}$, the remaining issue is how to set the learning rate $\eta_t$ and the weight decay coefficient $\lambda$. The assumption here is that the user already has experience tuning Adam, has achieved good results with Adam, and wants to quickly switch to Muon for a trial run.

Let's look at the "Moonlight Version" first. Its scaling factor is derived by aligning with the Update RMS of Adam. Simply put, the "Moonlight Version" Muon aligns with Adam's update magnitude, so the simplest way to migrate from Adam is: don't change anything; just use the same $\eta_t$ and $\lambda$ as Adam.

Next, consider the remaining three versions. We know that mainstream models usually have a $hidden_size$ (denoted as $d$), and the shape of the model's matrices mostly does not deviate significantly from $d \times d$. We can approximate by setting $d_{in} = d_{out} = d$. In this case, these three versions are identical and lack the factor of $0.2 \sqrt{d}$ compared to the "Moonlight Version." Since the "Moonlight Version" aligns with Adam's update magnitude without changing hyperparameters, the learning rate for these three versions should be scaled up by a factor of $0.2 \sqrt{d}$ to align with Adam's update magnitude. Correspondingly, $\lambda$ should be divided by $0.2 \sqrt{d}$.

Substituting $d=1024, 2048, 4096$, the results for $0.2 \sqrt{d}$ are approximately $6.4, 9, 12.8$. If you can't remember $0.2 \sqrt{d}$, you can simply remember that if we use the other three versions of Muon, we should generally multiply the Adam learning rate by $10$ to use as the Muon learning rate. If you directly plug the Adam learning rate into Muon, you will get the conclusion that Muon is far inferior to Adam due to underfitting. As far as the author knows, some negative reviews of Muon stem from this.

Does this mean the "Moonlight Version" is easier to use? The "Moonlight Version" indeed has good practical results, but saying it's better is evaluating it from the perspective of Adam. The advantage of the "MuP Version" or "Keller Jordan Version" is learning rate transferability, meaning that a learning rate tuned on a small model often works well when applied directly to a large model.

Other Parameters

If Muon only handles matrix parameters, what about the other parameters? For example, the Bias term of linear layers or the $\gamma$ term of RMSNorm are 1-dimensional parameters; and convolution layers might have 3-dimensional or 4-dimensional array parameters.

Let me first correct myself: Muon does not just handle matrix parameters; Muon only handles "matrix parameters of densely-input linear layers." If the reader finds this confusing, just remember that the matrix parameters of the Embedding layer and the final classification layer (including the GPT's LM Head) should not use Muon, or the effect will be noticeably worse. For these matrix parameters that cannot use Muon, as well as 1D, 3D, and higher-dimensional parameters, if the reader doesn't want to overthink it, they can just use Adam. Muon implementations are basically a mix with Adam, allowing users to select certain layers to use Adam.

If the reader is willing to tinker, then 3D or 4D parameters, such as those in convolution layers, can also use Muon. Taking Conv2D as an example, the kernel shape is usually $(w, h, d_{in}, d_{out})$. Its equivalent implementation is to flatten the $(w, h, d_{in})$ Patch input into a $w \times h \times d_{in}$ vector, and then reshape the kernel to $(w \times h \times d_{in}, d_{out})$ before performing matrix multiplication. So, to use Muon, you must first reshape the momentum to $(w \times h \times d_{in}, d_{out})$, calculate $\text{msign}$, and then reshape it back for the update.

Similarly, the $\gamma$ parameter of RMSNorm can be viewed as multiplication by a diagonal matrix. By treating its momentum as a diagonal matrix, $\text{msign}$ can also be calculated, which is equivalent to SignSGDM. The Embedding layer can be viewed as multiple $(1, d)$ matrices for $\text{msign}$ calculation, resulting in Normalized SGDM. If you want to go further, such as with Multi-Head Attention, you might consider whether the projection matrix for each Head can be individually isolated to calculate $\text{msign}$...

Never stop tinkering!

Expected Results

Finally, if the user follows the instructions above, sets everything up correctly, and starts running, then they can begin praying for the arrival of the goddess of luck.

What kind of result should we expect? If there are no anomalies such as gradient explosion, Muon will usually be slightly better than Adam in most cases. Of course, it is not ruled out that Muon may be slightly worse in some situations, but in any case, the difference between them will not be very large. If one is significantly better than the other, it might be necessary to reconsider if there was a problem with the settings on either side.

However, this is not absolute. For example, under certain extreme settings, it is possible for Muon to be much better than Adam, with Adam failing to converge no matter how it is tuned. Conversely, the opposite is rarely encountered. In summary, I wish you good luck. If you encounter any interesting phenomena, you are welcome to discuss and analyze them together.

Source Link: https://kexue.fm/archives/11416

hertschuh · 2025-12-06T17:22:34Z

keras/src/optimizers/muon.py

+        if len(variable.shape) != 2:
            return True


Thanks for the explanation.

* Implement logaddexp2 function in keras.ops (keras-team#21691) * [Keras 3 OpenVINO Backend]: Support numpy.sort (keras-team#21687) * [Keras 3 OpenVINO Backend]: Support numpy.median operation (keras-team#21667) * Fix deadlock in `CallbackList`. (keras-team#21701) * [OpenVINO backend] solve randomuniform issue (keras-team#21670) * Bug fixes with variable handling in `LossScaleOptimizer`. (keras-team#21706) * Do not use backend ops in `ProgBar`. (keras-team#21709) * Fix the Doc of the combination relation in func `keras.layers.Normali… * Remove reliance on `__jax_array__` to unwrap variables. (keras-team#21719) * Bump the github-actions group with 6 updates (keras-team#21705) * Add linspace and logspace implementations in OpenVINO NumPy backend (… * Add jvp op (keras-team#21720) * Add unfold op (keras-team#21685) * Add the description that `0` should not in the arg `axes` in `keras.l… * Add daily Python 3.13 CPU-only tests to nightly workflow (keras-team#21566) * Fix histogram op for symbolic inputs (keras-team#21729) * Relax tolerance for svd test (keras-team#21731) * Use jax.enable_x64 in place of jax.experimental.disable_x64 (keras-team#21734) * Refactor variable serialization. (keras-team#21713) * Ensure keras.ops.eye behavior is consistent across backends. (keras-team#21738) * Add `eye` support for OpenVINO backend (keras-team#21739) * Update Torch and Tensorflow versions in cuda requirements files. (keras-team#21… * Implement isreal function in keras.ops (keras-team#21740) * Remove the unused jax `enable_x64`. (keras-team#21737) * Correct implementation for several OpenVINO operations (keras-team#21746) * Sets `is_gptq_calibrated` flag when deserializing GPTQ models (keras-team#21748) * Correct implementation for several OpenVINO operations (keras-team#21752) * Fix the Bug in func `preprocess_input` when `x` in 3D and `data_forma… * Update Torch to 2.9.0 on GPU. (keras-team#21756) * `StringLookup` & `IntegerLookup` now save vocabulary loaded from file… * Implement trapezoid function in keras.ops (keras-team#21757) * Upstream `ReversibleEmbedding` from KerasHub. (keras-team#21753) * Raise exception on batch_size mismatch for stateful RNNs (keras-team#21742) * Propose a method for handling datasets which doesn't explicitly requi… * Use `filter="data"` option of `TarFile.extractall`. (keras-team#21760) * Add Distillation API to Keras (keras-team#21572) * removes unnecessary try-catch blocks and guard conditions (keras-team#21767) * cleanup distillation loss names (keras-team#21766) * Document that `set_backend` requires re-importing keras. (keras-team#21764) * Fix discretization discrepancy (keras-team#21769) * fix sas metrics in jax `fit` (keras-team#21765) * Support for extracting volume patches (keras-team#21759) * Fix negative index handling in MultiHeadAttention attention_axes (keras-team#21… * Make confusion metrics compilable. (keras-team#21775) * Suport keras.op.view() to view the same data bitwise at a new dtype … * Fix: `keras.ops.quantile` works with tf graph execution (keras-team#21782) * Fix typo in Distiller docstring * Add warning to `set_backend` and more detailed example. (keras-team#21787) * Don't fail `Variable.__repr__` if the value cannot be retrieved. (keras-team#21… * Update Keras backend installation instructions * Fix: Support 'jpg' format in keras.utils.save_img() (keras-team#21683) * Fix tf dataset detection logic. (keras-team#21794) * update test after jax.config.jax_vjp3 is enabled (keras-team#21776) * Add keras.ops.array_split for Tensor Parallelism Support (keras-team#21697) * Adding get_device_count function to the distribution_lib (keras-team#21791) * Fix: use raw string for CALIBRATION_TEXT (keras-team#21790) * Add backend compatibility table to documentation (keras-team#21733) * More OpenVINO Operations (keras-team#21774) * Support scalar view for tf backend. (keras-team#21802) * Address bug with convolution using Tensorflow, Numpy, Jax backends (#… * Fix bug with correlate for tensorflow (keras-team#21778) * Pass optional field in a few places to fix None input error. (keras-team#21818) * Fix(backend/torch): Resolve MPS broadcast crash in binary_crossentrop… * Fix broken example indentation in Keras io (keras-team#21807) * Add missing `convert_to_tensor` to `take_along_axis` on JAX. (keras-team#21825) * Added numpy.digitize support for OPENVINO backend (keras-team#21824) * Bump the github-actions group with 4 updates (keras-team#21809) * Fix typo in CONTRIBUTING.md (keras-team#21812) * Fix `Progbar.update` when receiving list, np arrays, and tensors. (#2… * Fix CosineDecay documentation to clarify alpha is a multiplier (keras-team#21827) * Fix noise_shape validation in keras.layers.Dropout (keras-team#21819) * Fix typos in some files (keras-team#21830) * Fix failing sklearn tests following release of pytest 9.0. (keras-team#21843) * Implement empty_like function in keras.ops (keras-team#21840) * Run tests on TPU (keras-team#21425) * Fix typo in variable name 'embeding' to 'embedding' (keras-team#21845) * Fix name_scope_stack AttributeError and IndexError in __exit__ (keras-team#21834) * Update keras3 Softmax mask handling to be more numerically robust. (#… * Support jax2tf in JaxLayer for tf backend (keras-team#21842) * Fix assigning a value to a variable within an autocast scope. (keras-team#21864) * Add note about label noise in CIFAR-10 dataset documentation (keras-team#21855) * Allow None inputs in `Layer.build`. (keras-team#21866) * `standardize_shape` normalizes the dimensions and tuple. (keras-team#21867) * Improve error message when layer/model input validation fails. (keras-team#21869) * Add verbose logging when ModelCheckpoint callback is done saving ... … * [OpenVINO backend] Remove deprecated openvino.runtime import (keras-team#21826) * Fix Torch output_padding constraint for ConvTranspose layers (keras-team#21852) * Support PyDataset in Normalization layer `adapt` methods (keras-team#21817) * Fix test failures when nnx is enabled (keras-team#21875) * Implement ldexp function in keras.ops (keras-team#21863) * Added OrbaxCheckpoint for keras 3.0 for Data centric saving and resto… * Add raise_error option to TerminateOnNaN for immediate termination on… * Fix NNX tests (keras-team#21884) * `keras.utils.set_random_seed` clear the global `SeedGenerator`. (keras-team#21874) * fix tpu test (keras-team#21893) * Model Export to liteRT (keras-team#21674) * Fix: torch layer losses keyword arguments in rematscope (keras-team#21865) * Add label to trigger TPU tests manually. (keras-team#21897) * Support tpu tests allowing tpu precision for matmul (keras-team#21887) * remove log (keras-team#21901) * Introduces layer filtering for quantization and fixes GPTQ dependency… * Replace `np.reshape(x, newshape=y)` with `np.reshape(x, y)`. (keras-team#21899) * Modified Dense layer documentation for use_bias with batch normalizat… * [OpenVINO Backend] Support np.diag (keras-team#20967) * Modify Muon optimizer (keras-team#21885) * Disables implicit GPTQ quantization using dtype_policy setter (keras-team#21895) * Dense: validate units argument (keras-team#21902) * Pin `ai-edge-litert` version to fix CI (keras-team#21912) * Increase JAX GPU tests timeout to 2 hours (keras-team#21915) * Fix TPU tests - for splash attention (keras-team#21891) * Support various filtering functions in OpenVINO (keras-team#21836) * OpenVINO NN Module Functions (keras-team#21803) * fix XLA dynamic shape output of ops.diag (keras-team#21906) * Fix: Remove redundant epsilon in loss mask weight calculation (keras-team#21908) * Implement vander function in keras.ops (keras-team#21882) * Fix Muon optimizer with TensorFlow backend. (keras-team#21924) * OpenVino `device_scope` and data adapters tests (keras-team#21922) * Fix fake quant gradient output shape and use `jax.grad` for tests. (#… * Introduces QuantizationConfig for fine-grained quantization control (… * Extended fix OOM Issue keras-team#21634 on Keras side (keras-team#21755) * Fix ops.tile shape inference issue on TensorFlow backend (keras-team#21860) * Add adaptive pooling (1D, 2D, 3D) support across JAX, NumPy, TensorFl… * More OpenVINO Numpy Operations (keras-team#21925) * Adds Serialization Support for QuantizationConfig based quantized mod… * Refactors AbsMaxQuantizer to accept axis in __call__ (keras-team#21931) * Speed up unit tests on JAX and TensorFlow. (keras-team#21933) * update dev version number (keras-team#21921) * Always use `run_tpu_tests` label to run the TPU tests. (keras-team#21900) * Revert "Always use `run_tpu_tests` label to run the TPU tests. (keras-team#2190… * Forward-fix for JAX API changes (keras-team#21938) * Remove nightly tests with Python 3.13. (keras-team#21943) * Do no always make batch size dynamic during export. (keras-team#21944) * Fix `numpy.mean` with dynamic shape on OpenVino. (keras-team#21947) * Remove NumPy warning with NumPy >= 2. (keras-team#21949) * Always use `run_tpu_tests` label to run the TPU tests. (keras-team#21950) * [OpenVINO backend] Support np.vander, np.trapezoid, np.corrcoef, np.c… * Fixed a bug in _keras_mask (keras-team#21946) * Fix handling of symbolic Tensor in RNN (keras-team#21945) * Add example for arctanh (keras-team#21951) * Fix DoS via malicious HDF5 dataset metadata in KerasFileEditor (keras-team#21880) * Implement nextafter function in keras.ops (keras-team#21960) * fix image.extract_patches strides handling (keras-team#21959) * [OpenVINO backend] Support numpy.flip (keras-team#21963) * Bump the github-actions group with 4 updates (keras-team#21968) * Fix CUDNN flash attention for JAX > 0.6.2. (keras-team#21970) * Skip `PyDataset` tests on TPU. (keras-team#21964) * Add missing `name` to `SeedGenerator.get_config`. (keras-team#21975) * Use `subprocess.run` in `pip_build.py` to escape wheel path. (keras-team#21976) * Update dependencies and `dependabot.yml`. (keras-team#21974) * Use `kokoro:force-run` label for TPU tests too. (keras-team#21956) * Add simple example for keras.layers.Resizing (keras-team#21966) * [OpenVINO backend] Support numpy.diagonal (keras-team#21965) * Bump actions/checkout from 5.0.1 to 6.0.1 in the github-actions group… * Fix ReversibleEmbedding mask error when using reverse=True (keras-team#21961) * Update feature_space.py (keras-team#21935) * Clarify Tracker docstring wording (keras-team#21985) * Remove semi-colon after email in SECURITY.md (keras-team#21993) * Implement cbrt function for OpenVINO backend (keras-team#21987) * Fix config keys for chain depth and num chains (keras-team#21979) * Implement hypot and trace function for OpenVINO backend (keras-team#21991) * Implement ptp function in keras.ops (keras-team#21990) * Orbax Loading and Sharding Support feature (keras-team#21903) * Add usage examples to loss docstrings (keras-team#21989) * Unify extract_patches to support both 2D and 3D patches (keras-team#21980) * Fix ndim to support tf.RaggedTensor by using shape.rank (keras-team#21999) * Implement size and swapaxes function for OpenVINO backend. (keras-team#21995) * Implement kron function for OpenVINO backend (keras-team#22000) * Adds support for AWQ (keras-team#21992) * Trigger TPU tests on kokoro label removal rather than addition. (keras-team#22001) * Document complex dtype limitation in ops.correlate (keras-team#21984) * [OpenVINO backend] Fix and enable numpy.rot90 (keras-team#21967) * Only skip TPU excluded tests on TPU. (keras-team#22008) * Improvements to `JaxLayer` and `FlaxLayer` related to RNG handling an… * Fix typo in contrast adjustment method (keras-team#22012) * Fix typo and improve docstring formatting (keras-team#22017) * Implement nansum function in keras.ops (keras-team#21996) * Fix unreliable Orbax checkpoint detection with custom implementation … * Unpin as many Python packages versions as possible. (keras-team#22023) * Allow `CenterCrop` layer to handle dynamic image sizes. (keras-team#22020) * TPU tests now verify that we can detect TPUs and fails it not. (keras-team#22019) * Refactor ExtractPatches to handle both 2D and 3D (keras-team#22013) * Implement argpartition function for OpenVINO backend (keras-team#22025) * Implement logaddexp2 function for OpenVINO backend (keras-team#22026) * Implement nanmin function in keras.ops (keras-team#22040) * Increase test coverage for IntegerLookup layer (keras-team#22022) * feat: Add documentation examples for image preprocessing augmentation… * Fix: activity regularizer not normalized by batch size (keras-team#22021) * Implement ldexp and select ops for OpenVINO backend (keras-team#22042) * Fix: convert deque to list before tf.transpose in keras.ops.quantile … * Fix timedistributed mask validation (keras-team#22039) * Torch backend: allow explicit device selection and guard DirectML usa… * Implement nanmax function in keras.ops (keras-team#22043) * Add bias support for torch's `dot_product_attention`. (keras-team#22045) * Fix incorrect example in `ops.associative_scan` docstring (keras-team#22051) * Add Batch Renormalisation (keras-team#22047) * Implement round and divide_no_nan ops for OpenVINO backend (keras-team#22052) * Add dynamic shape support for torch backend export (keras-team#22041) * Implement vstack func for OpenVINO backend (keras-team#22059) * Implement ptp function for OpenVINO backend (keras-team#22060) * Implement nanmean function in keras.ops (keras-team#22055) * Do not allow external links in HDF5 files. (keras-team#22057) * Fix discretization symbolic one hot (keras-team#22048) * Implement complete Keras-Orbax checkpoint integration (keras-team#22002) * Increase test coverage for StringLookup preprocessing layer (keras-team#22056) * Set mutable to True by default in nnx_metadata (keras-team#22074) * Adds Asymmetric INT4 Sub-Channel Quantization Support (keras-team#22007) * Allow passing variables to a function with `@custom_gradient`. (keras-team#22069) * Disallow TFSMLayer deserialization in safe_mode to prevent external S… * Remove redundant global seed initialization code. (keras-team#22084) * Add `Muon` to the list of all optimizer classes. (keras-team#22083) * Implement tile function for openvino backend (keras-team#22071) * implement nansum ops for openvino backend (keras-team#22078) * Remove `testing.uses_cpu()` and re-implement for JAX. (keras-team#22087) * benchmarks: add RandomRotation tf.data performance benchmark (keras-team#21986) * Fix arctan2 NaN propagation in OpenVINO backend (keras-team#22064) * Validate positive height and width in image resize (keras-team#22079) * Don't skip some JAX linalg tests on JAX. (keras-team#22091) * Implement nanprod function in keras.ops (keras-team#22089) * Increase test coverage for TextVectorization layer (keras-team#22066) * Bump the github-actions group with 2 updates (keras-team#22093) * fix: pytorch onnx export symbolic test (keras-team#22086) * Improvements to `*_uses_gpu` and `*_uses_tpu`. (keras-team#22088) * Implement cross product operation for OpenVINO backend (keras-team#22096) * Fail fast on invalid convolution output shapes during symbolic execut… * Fix Normalization broadcasting for scalar and multidim mean and varia… * Standardize the way tests are skipped based on backend and accelerato… * Don't call `pythonify_logs` within `get_metrics_result`. (keras-team#22107) * Fix gaussian_blur padding calculation for even kernel sizes (keras-team#22054) * Adjust JAX variable initializer jitting criteria. (keras-team#22116) * Exclude conv transpose tests on TPU. (keras-team#22117) * Remove incorrect but dead code in `BaseOptimizer.stateless_apply`. (#… * Implement tensordot operation for OpenVINO backend (keras-team#22098) * Fix bounding box docstring references (keras-team#22110) * feat: add depth_to_space and space_to_depth ops (keras-team#22112) * Fix sparse reshape test with Numpy 2.4. (keras-team#22141) * Fix vocabulary reload corruption caused by trailing newline handling … * Add support for dynamic dimensions in `ops.slice.compute_output_spec`… * Revamp graph validation in `Function.__init__`. (keras-team#22153) * Fix: draw_bounding_boxes float32 to uint8 conversion (keras-team#22129) * Implement dstack function across all backends (keras-team#22120) * Add exp2 operation to OpenVINO backend (keras-team#22131) * Add trunc operation to OpenVINO backend (keras-team#22134) * Fix: add missing validation for output padding < strides (keras-team#22130) * docs: Add guide on resuming training from weight-only checkpoints (#2… * feat(openvino): upgrade opset to opset15 (keras-team#22159) * Fix order-dependent float16/bfloat16 promotion in cast_to_common_dtyp… * Fix TrackedDict constructor to support iterable (key, value) inputs (… * Implement numpy.gcd using Euclidean algorithm for OpenVINO backend (#… * [Keras 3] Refactor ExportArchive to be a dispatcher for different exp… * [Keras 3] Refactor ExportArchive to be a dispatcher for different exp…

modify muon.

22b37a4

google-ml-butler bot added the size:M label Dec 2, 2025

google-ml-butler bot assigned gbaned Dec 2, 2025

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

keras/src/optimizers/muon.py Outdated Show resolved Hide resolved

keras/src/optimizers/muon.py Outdated Show resolved Hide resolved

keras/src/optimizers/muon.py Outdated Show resolved Hide resolved

keras/src/optimizers/muon.py Outdated Show resolved Hide resolved

modify gemini review.

0737fc4

pass-lin mentioned this pull request Dec 2, 2025

Modify Muon optimizer #21859

Closed

pass-lin changed the title ~~Modify Muon optimizermodify muon.~~ Modify Muon optimizer Dec 2, 2025

hertschuh reviewed Dec 2, 2025

View reviewed changes

hertschuh added the stat:awaiting response from contributor label Dec 3, 2025

google-ml-butler bot removed the stat:awaiting response from contributor label Dec 3, 2025

modify

4e4f375

hertschuh approved these changes Dec 6, 2025

View reviewed changes

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Dec 6, 2025

kokoro-team removed the kokoro:force-run label Dec 6, 2025

hertschuh merged commit d130816 into keras-team:master Dec 7, 2025
13 of 15 checks passed

Conversation

pass-lin commented Dec 2, 2025

Uh oh!

gemini-code-assist bot commented Dec 2, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

pass-lin commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hertschuh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pass-lin commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pass-lin commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

A Guide to the Muon Optimizer: Quick Start and Key Details

Brief Introduction

Four Versions

The Two Dimensions

Hyperparameter Settings

Other Parameters

Expected Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pass-lin commented Dec 2, 2025 •

edited

Loading

codecov-commenter commented Dec 2, 2025 •

edited

Loading

pass-lin commented Dec 3, 2025 •

edited

Loading

pass-lin commented Dec 3, 2025 •

edited

Loading