Conversation
Summary of ChangesHello @pass-lin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refines the Muon optimizer by enhancing its flexibility and robustness. It introduces a dedicated weight decay parameter for Adam-optimized variables, updates the learning rate adjustment logic to a more widely recognized 'Moonlight version' for improved stability, and clarifies the optimizer's application scope to strictly 2D variables, delegating other dimensions to Adam. These changes aim to make Muon a more versatile and performant optimizer, especially in diverse model architectures. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
@fchollet @hertschuh |
There was a problem hiding this comment.
Code Review
This pull request introduces several valuable improvements to the Muon optimizer, including adding a separate weight decay for Adam, aligning with the Moonlight implementation, and restricting Muon updates to 2D variables for better stability. The changes are well-motivated and correctly implemented. I've provided a few suggestions to enhance docstring clarity and improve code readability, in line with the repository's style guide.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #21885 +/- ##
==========================================
- Coverage 82.63% 82.56% -0.07%
==========================================
Files 577 578 +1
Lines 59415 59804 +389
Branches 9313 9389 +76
==========================================
+ Hits 49097 49377 +280
- Misses 7913 8004 +91
- Partials 2405 2423 +18
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
hertschuh
left a comment
There was a problem hiding this comment.
Have you tested this end-to-end with model.fit to see that it trains as expected?
Is there a way to compare with the original implementation?
| if len(variable.shape) != 2: | ||
| return True |
There was a problem hiding this comment.
I'm not following this change. In the Moonlight implementation the criteria for using Muon is that the ndim >= 2: https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.py#L296
So for AdamW, the criteria would be ndim < 2.
There was a problem hiding this comment.
I'm not following this change. In the Moonlight implementation the criteria for using Muon is that the : https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.py#L296`ndim >= 2`
So for AdamW, the criteria would be .
ndim < 2
The optimization target of Muon is matrices. In the 3D case, reshaping into matrices is necessary for effective optimization. However, this involves too many assumptions, and introducing it would only unnecessarily increase complexity. In fact, Muon never considered the case of CNNs. It was designed with only 1D-Transformer scenarios in mind.
There was a problem hiding this comment.
I'm not following this change. In the Moonlight implementation the criteria for using Muon is that the : https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.py#L296`ndim >= 2`
So for AdamW, the criteria would be .
ndim < 2
In the original implementation of MoonLight, they could ensure that the optimization target is a Transformer model based on PyTorch. However, in the Keras implementation, we cannot guarantee this. For example, in a typical case with the PyTorch backend, if we mix keras.layers.Dense and torch.nn.Linear, then the optimization targets would simultaneously include variables of shape [d_out, d_in] and [d_in, d_out].
Similarly, if the optimization target is a 3D CNN model, the parameter meanings for the CNN model differ between the "channels_last" and "channels_first" formats. We lack reasonable assumptions to perform reshaping in such cases.
The Muon optimizer in Keras should be a general-purpose optimizer, and a general-purpose optimizer should not rely on too many assumptions. Therefore, we can only use the most conservative approach: we do not optimize anything other than matrices.
This is also the reason why we do not use the Keller Jordan Version. The Keller Jordan Version assumes that the optimized matrix must be either [d_out, d_in] or [d_in, d_out], while MoonLight does not require such assumptions.
There was a problem hiding this comment.
Thanks for the explanation.
keras/src/optimizers/muon.py
Outdated
| self.assign_sub( | ||
| variable, | ||
| lr | ||
| * self.zeropower_via_newtonschulz5(g, self.ns_steps) | ||
| * max(1, shape[0] / shape[1]) ** 0.5, | ||
| self.lr_adjust(lr * update), | ||
| ) |
There was a problem hiding this comment.
nitpick: this can be on a single line now.
| continue | ||
| wd = ops.cast(weight_decay_value, variable.dtype) | ||
| lr = ops.cast(self.learning_rate, variable.dtype) | ||
| variable.assign(variable - variable * wd * lr) |
There was a problem hiding this comment.
Use self.assign(variable, variable - variable * wd * lr)
There was a problem hiding this comment.
Use
self.assign(variable, variable - variable * wd * lr)
keras/keras/src/optimizers/base_optimizer.py
Line 967 in 846a297
Here, I maintain consistency with the existing weight decay implementation.
This is one of my pre-training tasks. Each epoch consists of 512 steps. It can be observed that MoonLight Muon not only trains normally, but also has a more stable loss compared to Adam. |
|
@hertschuh A Guide to the Muon Optimizer: Quick Start and Key DetailsDuring this period, I believe many readers have already come across news about the Muon optimizer. Muon was first proposed around last October by Keller Jordan on Twitter, which means it has been around for just over a year. However, in this single year, Muon has already endured the training scrutiny of models with billions, hundreds of billions, and even trillions of parameters, which is enough to prove that it is a highly competitive optimizer. Muon is now built into training frameworks like Torch and Keras, and even large-scale frameworks like Megatron are gradually starting to support it, indicating that it has gained general acceptance in the industry. However, for readers who are only familiar with Adam, how to quickly and effectively switch to Muon may still be a confusing matter. Therefore, this article attempts to provide a quick start guide. Brief IntroductionThe official proposer of Muon is Keller Jordan, who currently works at OpenAI. As mentioned earlier, Muon was first published on Twitter, and to this day, the author has only written a blog post, ["Muon: An optimizer for hidden layers in neural networks,"](https://www.youtube.com/watch?v=dQw4w9WgXcQ) instead of a paper. The author's view is that "whether or not it is written as a Paper has nothing to do with whether the optimizer is effective [original quote]." Muon is an optimizer specifically customized for matrix parameters. There are some related works with similar characteristics, such as Shampoo, and the earlier Stochastic Spectral Descent, etc. Many works can be associated with Muon to a greater or lesser extent, but none completely cover Muon, so the author considers Muon to be a completely new work. In China, the earliest article to popularize Muon was probably the author's blog post, Muon Optimizer Appreciation: An Essential Leap from Vector to Matrix, and the first large-scale model to verify Muon was our Moonlight, released in February, whose proposed Moonlight version of Muon was used in the subsequent trillion-parameter K2 model. Following K2, GLM-4.5 also used this Muon variant. As Jeremy Bernstein, one of Muon's authors, stated in his blog post Deriving Muon, for the author of this article, Muon's uniqueness lies in the fact that it can be derived from more fundamental optimization principles and is effective in practice. In contrast, although Adam is also very effective, it is more like a heuristic solution. Four VersionsThis article does not intend to introduce the mathematical details or the implementation of Muon but focuses primarily on the technical details and precautions for switching from Adam to Muon. As mentioned, Muon is specifically for matrix parameter optimization and uses a non-element-wise update rule, which can be confusing for new users. Furthermore, as far as the author knows, there are currently at least four slightly different versions of Muon, and this multi-version phenomenon exacerbates the confusion. If users don't understand the details, they might get poor results by setting the wrong hyperparameters (especially the learning rate). The following section will clarify these details. First, for a matrix Naive Version (朴素版): Keller Jordan Version (Keller Jordan 版): MuP Version (MuP 版): Moonlight Version (Moonlight 版): To enable Nesterov momentum, replace The only difference between the four versions is the scaling factor before The Two DimensionsHere, we must pay attention to an important detail: the "Keller Jordan Version" and the "MuP Version" are sensitive to the order of Therefore, to implement the "Keller Jordan Version" of Muon, the scaling factor for Torch's Linear layer should be If you write your own model, you need to judge carefully based on your own code. Of course, if you find figuring this out too troublesome, you can consider using the "Moonlight Version," whose scaling factor is symmetric with respect to Hyperparameter SettingsAfter clarifying Let's look at the "Moonlight Version" first. Its scaling factor is derived by aligning with the Update RMS of Adam. Simply put, the "Moonlight Version" Muon aligns with Adam's update magnitude, so the simplest way to migrate from Adam is: don't change anything; just use the same Next, consider the remaining three versions. We know that mainstream models usually have a Substituting Does this mean the "Moonlight Version" is easier to use? The "Moonlight Version" indeed has good practical results, but saying it's better is evaluating it from the perspective of Adam. The advantage of the "MuP Version" or "Keller Jordan Version" is learning rate transferability, meaning that a learning rate tuned on a small model often works well when applied directly to a large model. Other ParametersIf Muon only handles matrix parameters, what about the other parameters? For example, the Bias term of linear layers or the Let me first correct myself: Muon does not just handle matrix parameters; Muon only handles "matrix parameters of densely-input linear layers." If the reader finds this confusing, just remember that the matrix parameters of the Embedding layer and the final classification layer (including the GPT's LM Head) should not use Muon, or the effect will be noticeably worse. For these matrix parameters that cannot use Muon, as well as 1D, 3D, and higher-dimensional parameters, if the reader doesn't want to overthink it, they can just use Adam. Muon implementations are basically a mix with Adam, allowing users to select certain layers to use Adam. If the reader is willing to tinker, then 3D or 4D parameters, such as those in convolution layers, can also use Muon. Taking Conv2D as an example, the kernel shape is usually Similarly, the Never stop tinkering! Expected ResultsFinally, if the user follows the instructions above, sets everything up correctly, and starts running, then they can begin praying for the arrival of the goddess of luck. What kind of result should we expect? If there are no anomalies such as gradient explosion, Muon will usually be slightly better than Adam in most cases. Of course, it is not ruled out that Muon may be slightly worse in some situations, but in any case, the difference between them will not be very large. If one is significantly better than the other, it might be necessary to reconsider if there was a problem with the settings on either side. However, this is not absolute. For example, under certain extreme settings, it is possible for Muon to be much better than Adam, with Adam failing to converge no matter how it is tuned. Conversely, the opposite is rarely encountered. In summary, I wish you good luck. If you encounter any interesting phenomena, you are welcome to discuss and analyze them together. Source Link: https://kexue.fm/archives/11416 |
| if len(variable.shape) != 2: | ||
| return True |
There was a problem hiding this comment.
Thanks for the explanation.
* Implement logaddexp2 function in keras.ops (keras-team#21691) * [Keras 3 OpenVINO Backend]: Support numpy.sort (keras-team#21687) * [Keras 3 OpenVINO Backend]: Support numpy.median operation (keras-team#21667) * Fix deadlock in `CallbackList`. (keras-team#21701) * [OpenVINO backend] solve randomuniform issue (keras-team#21670) * Bug fixes with variable handling in `LossScaleOptimizer`. (keras-team#21706) * Do not use backend ops in `ProgBar`. (keras-team#21709) * Fix the Doc of the combination relation in func `keras.layers.Normali… * Remove reliance on `__jax_array__` to unwrap variables. (keras-team#21719) * Bump the github-actions group with 6 updates (keras-team#21705) * Add linspace and logspace implementations in OpenVINO NumPy backend (… * Add jvp op (keras-team#21720) * Add unfold op (keras-team#21685) * Add the description that `0` should not in the arg `axes` in `keras.l… * Add daily Python 3.13 CPU-only tests to nightly workflow (keras-team#21566) * Fix histogram op for symbolic inputs (keras-team#21729) * Relax tolerance for svd test (keras-team#21731) * Use jax.enable_x64 in place of jax.experimental.disable_x64 (keras-team#21734) * Refactor variable serialization. (keras-team#21713) * Ensure keras.ops.eye behavior is consistent across backends. (keras-team#21738) * Add `eye` support for OpenVINO backend (keras-team#21739) * Update Torch and Tensorflow versions in cuda requirements files. (keras-team#21… * Implement isreal function in keras.ops (keras-team#21740) * Remove the unused jax `enable_x64`. (keras-team#21737) * Correct implementation for several OpenVINO operations (keras-team#21746) * Sets `is_gptq_calibrated` flag when deserializing GPTQ models (keras-team#21748) * Correct implementation for several OpenVINO operations (keras-team#21752) * Fix the Bug in func `preprocess_input` when `x` in 3D and `data_forma… * Update Torch to 2.9.0 on GPU. (keras-team#21756) * `StringLookup` & `IntegerLookup` now save vocabulary loaded from file… * Implement trapezoid function in keras.ops (keras-team#21757) * Upstream `ReversibleEmbedding` from KerasHub. (keras-team#21753) * Raise exception on batch_size mismatch for stateful RNNs (keras-team#21742) * Propose a method for handling datasets which doesn't explicitly requi… * Use `filter="data"` option of `TarFile.extractall`. (keras-team#21760) * Add Distillation API to Keras (keras-team#21572) * removes unnecessary try-catch blocks and guard conditions (keras-team#21767) * cleanup distillation loss names (keras-team#21766) * Document that `set_backend` requires re-importing keras. (keras-team#21764) * Fix discretization discrepancy (keras-team#21769) * fix sas metrics in jax `fit` (keras-team#21765) * Support for extracting volume patches (keras-team#21759) * Fix negative index handling in MultiHeadAttention attention_axes (keras-team#21… * Make confusion metrics compilable. (keras-team#21775) * Suport keras.op.view() to view the same data bitwise at a new dtype … * Fix: `keras.ops.quantile` works with tf graph execution (keras-team#21782) * Fix typo in Distiller docstring * Add warning to `set_backend` and more detailed example. (keras-team#21787) * Don't fail `Variable.__repr__` if the value cannot be retrieved. (keras-team#21… * Update Keras backend installation instructions * Fix: Support 'jpg' format in keras.utils.save_img() (keras-team#21683) * Fix tf dataset detection logic. (keras-team#21794) * update test after jax.config.jax_vjp3 is enabled (keras-team#21776) * Add keras.ops.array_split for Tensor Parallelism Support (keras-team#21697) * Adding get_device_count function to the distribution_lib (keras-team#21791) * Fix: use raw string for CALIBRATION_TEXT (keras-team#21790) * Add backend compatibility table to documentation (keras-team#21733) * More OpenVINO Operations (keras-team#21774) * Support scalar view for tf backend. (keras-team#21802) * Address bug with convolution using Tensorflow, Numpy, Jax backends (#… * Fix bug with correlate for tensorflow (keras-team#21778) * Pass optional field in a few places to fix None input error. (keras-team#21818) * Fix(backend/torch): Resolve MPS broadcast crash in binary_crossentrop… * Fix broken example indentation in Keras io (keras-team#21807) * Add missing `convert_to_tensor` to `take_along_axis` on JAX. (keras-team#21825) * Added numpy.digitize support for OPENVINO backend (keras-team#21824) * Bump the github-actions group with 4 updates (keras-team#21809) * Fix typo in CONTRIBUTING.md (keras-team#21812) * Fix `Progbar.update` when receiving list, np arrays, and tensors. (#2… * Fix CosineDecay documentation to clarify alpha is a multiplier (keras-team#21827) * Fix noise_shape validation in keras.layers.Dropout (keras-team#21819) * Fix typos in some files (keras-team#21830) * Fix failing sklearn tests following release of pytest 9.0. (keras-team#21843) * Implement empty_like function in keras.ops (keras-team#21840) * Run tests on TPU (keras-team#21425) * Fix typo in variable name 'embeding' to 'embedding' (keras-team#21845) * Fix name_scope_stack AttributeError and IndexError in __exit__ (keras-team#21834) * Update keras3 Softmax mask handling to be more numerically robust. (#… * Support jax2tf in JaxLayer for tf backend (keras-team#21842) * Fix assigning a value to a variable within an autocast scope. (keras-team#21864) * Add note about label noise in CIFAR-10 dataset documentation (keras-team#21855) * Allow None inputs in `Layer.build`. (keras-team#21866) * `standardize_shape` normalizes the dimensions and tuple. (keras-team#21867) * Improve error message when layer/model input validation fails. (keras-team#21869) * Add verbose logging when ModelCheckpoint callback is done saving ... … * [OpenVINO backend] Remove deprecated openvino.runtime import (keras-team#21826) * Fix Torch output_padding constraint for ConvTranspose layers (keras-team#21852) * Support PyDataset in Normalization layer `adapt` methods (keras-team#21817) * Fix test failures when nnx is enabled (keras-team#21875) * Implement ldexp function in keras.ops (keras-team#21863) * Added OrbaxCheckpoint for keras 3.0 for Data centric saving and resto… * Add raise_error option to TerminateOnNaN for immediate termination on… * Fix NNX tests (keras-team#21884) * `keras.utils.set_random_seed` clear the global `SeedGenerator`. (keras-team#21874) * fix tpu test (keras-team#21893) * Model Export to liteRT (keras-team#21674) * Fix: torch layer losses keyword arguments in rematscope (keras-team#21865) * Add label to trigger TPU tests manually. (keras-team#21897) * Support tpu tests allowing tpu precision for matmul (keras-team#21887) * remove log (keras-team#21901) * Introduces layer filtering for quantization and fixes GPTQ dependency… * Replace `np.reshape(x, newshape=y)` with `np.reshape(x, y)`. (keras-team#21899) * Modified Dense layer documentation for use_bias with batch normalizat… * [OpenVINO Backend] Support np.diag (keras-team#20967) * Modify Muon optimizer (keras-team#21885) * Disables implicit GPTQ quantization using dtype_policy setter (keras-team#21895) * Dense: validate units argument (keras-team#21902) * Pin `ai-edge-litert` version to fix CI (keras-team#21912) * Increase JAX GPU tests timeout to 2 hours (keras-team#21915) * Fix TPU tests - for splash attention (keras-team#21891) * Support various filtering functions in OpenVINO (keras-team#21836) * OpenVINO NN Module Functions (keras-team#21803) * fix XLA dynamic shape output of ops.diag (keras-team#21906) * Fix: Remove redundant epsilon in loss mask weight calculation (keras-team#21908) * Implement vander function in keras.ops (keras-team#21882) * Fix Muon optimizer with TensorFlow backend. (keras-team#21924) * OpenVino `device_scope` and data adapters tests (keras-team#21922) * Fix fake quant gradient output shape and use `jax.grad` for tests. (#… * Introduces QuantizationConfig for fine-grained quantization control (… * Extended fix OOM Issue keras-team#21634 on Keras side (keras-team#21755) * Fix ops.tile shape inference issue on TensorFlow backend (keras-team#21860) * Add adaptive pooling (1D, 2D, 3D) support across JAX, NumPy, TensorFl… * More OpenVINO Numpy Operations (keras-team#21925) * Adds Serialization Support for QuantizationConfig based quantized mod… * Refactors AbsMaxQuantizer to accept axis in __call__ (keras-team#21931) * Speed up unit tests on JAX and TensorFlow. (keras-team#21933) * update dev version number (keras-team#21921) * Always use `run_tpu_tests` label to run the TPU tests. (keras-team#21900) * Revert "Always use `run_tpu_tests` label to run the TPU tests. (keras-team#2190… * Forward-fix for JAX API changes (keras-team#21938) * Remove nightly tests with Python 3.13. (keras-team#21943) * Do no always make batch size dynamic during export. (keras-team#21944) * Fix `numpy.mean` with dynamic shape on OpenVino. (keras-team#21947) * Remove NumPy warning with NumPy >= 2. (keras-team#21949) * Always use `run_tpu_tests` label to run the TPU tests. (keras-team#21950) * [OpenVINO backend] Support np.vander, np.trapezoid, np.corrcoef, np.c… * Fixed a bug in _keras_mask (keras-team#21946) * Fix handling of symbolic Tensor in RNN (keras-team#21945) * Add example for arctanh (keras-team#21951) * Fix DoS via malicious HDF5 dataset metadata in KerasFileEditor (keras-team#21880) * Implement nextafter function in keras.ops (keras-team#21960) * fix image.extract_patches strides handling (keras-team#21959) * [OpenVINO backend] Support numpy.flip (keras-team#21963) * Bump the github-actions group with 4 updates (keras-team#21968) * Fix CUDNN flash attention for JAX > 0.6.2. (keras-team#21970) * Skip `PyDataset` tests on TPU. (keras-team#21964) * Add missing `name` to `SeedGenerator.get_config`. (keras-team#21975) * Use `subprocess.run` in `pip_build.py` to escape wheel path. (keras-team#21976) * Update dependencies and `dependabot.yml`. (keras-team#21974) * Use `kokoro:force-run` label for TPU tests too. (keras-team#21956) * Add simple example for keras.layers.Resizing (keras-team#21966) * [OpenVINO backend] Support numpy.diagonal (keras-team#21965) * Bump actions/checkout from 5.0.1 to 6.0.1 in the github-actions group… * Fix ReversibleEmbedding mask error when using reverse=True (keras-team#21961) * Update feature_space.py (keras-team#21935) * Clarify Tracker docstring wording (keras-team#21985) * Remove semi-colon after email in SECURITY.md (keras-team#21993) * Implement cbrt function for OpenVINO backend (keras-team#21987) * Fix config keys for chain depth and num chains (keras-team#21979) * Implement hypot and trace function for OpenVINO backend (keras-team#21991) * Implement ptp function in keras.ops (keras-team#21990) * Orbax Loading and Sharding Support feature (keras-team#21903) * Add usage examples to loss docstrings (keras-team#21989) * Unify extract_patches to support both 2D and 3D patches (keras-team#21980) * Fix ndim to support tf.RaggedTensor by using shape.rank (keras-team#21999) * Implement size and swapaxes function for OpenVINO backend. (keras-team#21995) * Implement kron function for OpenVINO backend (keras-team#22000) * Adds support for AWQ (keras-team#21992) * Trigger TPU tests on kokoro label removal rather than addition. (keras-team#22001) * Document complex dtype limitation in ops.correlate (keras-team#21984) * [OpenVINO backend] Fix and enable numpy.rot90 (keras-team#21967) * Only skip TPU excluded tests on TPU. (keras-team#22008) * Improvements to `JaxLayer` and `FlaxLayer` related to RNG handling an… * Fix typo in contrast adjustment method (keras-team#22012) * Fix typo and improve docstring formatting (keras-team#22017) * Implement nansum function in keras.ops (keras-team#21996) * Fix unreliable Orbax checkpoint detection with custom implementation … * Unpin as many Python packages versions as possible. (keras-team#22023) * Allow `CenterCrop` layer to handle dynamic image sizes. (keras-team#22020) * TPU tests now verify that we can detect TPUs and fails it not. (keras-team#22019) * Refactor ExtractPatches to handle both 2D and 3D (keras-team#22013) * Implement argpartition function for OpenVINO backend (keras-team#22025) * Implement logaddexp2 function for OpenVINO backend (keras-team#22026) * Implement nanmin function in keras.ops (keras-team#22040) * Increase test coverage for IntegerLookup layer (keras-team#22022) * feat: Add documentation examples for image preprocessing augmentation… * Fix: activity regularizer not normalized by batch size (keras-team#22021) * Implement ldexp and select ops for OpenVINO backend (keras-team#22042) * Fix: convert deque to list before tf.transpose in keras.ops.quantile … * Fix timedistributed mask validation (keras-team#22039) * Torch backend: allow explicit device selection and guard DirectML usa… * Implement nanmax function in keras.ops (keras-team#22043) * Add bias support for torch's `dot_product_attention`. (keras-team#22045) * Fix incorrect example in `ops.associative_scan` docstring (keras-team#22051) * Add Batch Renormalisation (keras-team#22047) * Implement round and divide_no_nan ops for OpenVINO backend (keras-team#22052) * Add dynamic shape support for torch backend export (keras-team#22041) * Implement vstack func for OpenVINO backend (keras-team#22059) * Implement ptp function for OpenVINO backend (keras-team#22060) * Implement nanmean function in keras.ops (keras-team#22055) * Do not allow external links in HDF5 files. (keras-team#22057) * Fix discretization symbolic one hot (keras-team#22048) * Implement complete Keras-Orbax checkpoint integration (keras-team#22002) * Increase test coverage for StringLookup preprocessing layer (keras-team#22056) * Set mutable to True by default in nnx_metadata (keras-team#22074) * Adds Asymmetric INT4 Sub-Channel Quantization Support (keras-team#22007) * Allow passing variables to a function with `@custom_gradient`. (keras-team#22069) * Disallow TFSMLayer deserialization in safe_mode to prevent external S… * Remove redundant global seed initialization code. (keras-team#22084) * Add `Muon` to the list of all optimizer classes. (keras-team#22083) * Implement tile function for openvino backend (keras-team#22071) * implement nansum ops for openvino backend (keras-team#22078) * Remove `testing.uses_cpu()` and re-implement for JAX. (keras-team#22087) * benchmarks: add RandomRotation tf.data performance benchmark (keras-team#21986) * Fix arctan2 NaN propagation in OpenVINO backend (keras-team#22064) * Validate positive height and width in image resize (keras-team#22079) * Don't skip some JAX linalg tests on JAX. (keras-team#22091) * Implement nanprod function in keras.ops (keras-team#22089) * Increase test coverage for TextVectorization layer (keras-team#22066) * Bump the github-actions group with 2 updates (keras-team#22093) * fix: pytorch onnx export symbolic test (keras-team#22086) * Improvements to `*_uses_gpu` and `*_uses_tpu`. (keras-team#22088) * Implement cross product operation for OpenVINO backend (keras-team#22096) * Fail fast on invalid convolution output shapes during symbolic execut… * Fix Normalization broadcasting for scalar and multidim mean and varia… * Standardize the way tests are skipped based on backend and accelerato… * Don't call `pythonify_logs` within `get_metrics_result`. (keras-team#22107) * Fix gaussian_blur padding calculation for even kernel sizes (keras-team#22054) * Adjust JAX variable initializer jitting criteria. (keras-team#22116) * Exclude conv transpose tests on TPU. (keras-team#22117) * Remove incorrect but dead code in `BaseOptimizer.stateless_apply`. (#… * Implement tensordot operation for OpenVINO backend (keras-team#22098) * Fix bounding box docstring references (keras-team#22110) * feat: add depth_to_space and space_to_depth ops (keras-team#22112) * Fix sparse reshape test with Numpy 2.4. (keras-team#22141) * Fix vocabulary reload corruption caused by trailing newline handling … * Add support for dynamic dimensions in `ops.slice.compute_output_spec`… * Revamp graph validation in `Function.__init__`. (keras-team#22153) * Fix: draw_bounding_boxes float32 to uint8 conversion (keras-team#22129) * Implement dstack function across all backends (keras-team#22120) * Add exp2 operation to OpenVINO backend (keras-team#22131) * Add trunc operation to OpenVINO backend (keras-team#22134) * Fix: add missing validation for output padding < strides (keras-team#22130) * docs: Add guide on resuming training from weight-only checkpoints (#2… * feat(openvino): upgrade opset to opset15 (keras-team#22159) * Fix order-dependent float16/bfloat16 promotion in cast_to_common_dtyp… * Fix TrackedDict constructor to support iterable (key, value) inputs (… * Implement numpy.gcd using Euclidean algorithm for OpenVINO backend (#… * [Keras 3] Refactor ExportArchive to be a dispatcher for different exp… * [Keras 3] Refactor ExportArchive to be a dispatcher for different exp…

In this PR, we have introduced three improvements to Muon:
1.In the Muon optimizer, we often designate a subset of variables to be optimized with Adam. However, since different optimizers should not be assumed to have the same weight decay parameter, we addressed this by adding an adam_weight_decay parameter.
2.The current implementation of Muon mainly references the KellerJordan version. However, the Moonlight version is now widely recognized as superior. Compared to the KellerJordan version, the Moonlight version adjusts the learning rate from max(d_out/d_in, 1)**0.5 to max(d_out, d_in) * rate. The KellerJordan version assumes that the second dimension is the output dimension and the first dimension is the input dimension. As a general-purpose optimizer, we should not make such assumptions.
Additionally, the Moonlight version allows Muon and Adam to maintain the same weight decay and learning rate. We have added an rms_rate parameter to enable this feature, with a default value of 0.2. This parameter can be disabled by setting it to None. We have also adjusted some default parameters based on the Moonlight version.
3.When we initially submitted Muon optimizer, our understanding of Muon was not deep enough. As our research progressed, we discovered that Muon was designed with the assumption that the model is a Transformer. For 3D weights, it is necessary to assume that one dimension is d_in, and the other dimensions are reshaped to d_out. However, unlike the 2D case, the 3D scenario does not always have a clear distinction between d_in and d_out. Therefore, out of caution, we only use the Adam optimizer for cases other than 2D.