@@ -2482,6 +2482,8 @@ This version of the operator has been available since version 1 of the 'com.micr
24822482<dd >Rotate using interleaved pattern. Default value is 0 (False).</dd >
24832483<dt ><tt >scale</tt > : float</dt >
24842484<dd >Custom scale will be used if specified. Default value is 1/sqrt(head_size)</dd >
2485+ <dt ><tt >smooth_softmax</tt > : int</dt >
2486+ <dd >Use a smooth factor in softmax.</dd >
24852487</dl >
24862488
24872489#### Inputs (7 - 9)
@@ -3022,6 +3024,8 @@ This version of the operator has been available since version 1 of the 'com.micr
30223024<dd >Number of top experts to select from expert pool</dd >
30233025<dt ><tt >normalize_routing_weights</tt > : int</dt >
30243026<dd >Whether to normalize routing weights</dd >
3027+ <dt ><tt >use_sparse_mixer</tt > : int</dt >
3028+ <dd >Whether to use sparse mixer</dd >
30253029</dl >
30263030
30273031#### Inputs (5 - 8)
@@ -4337,7 +4341,7 @@ This version of the operator has been available since version 1 of the 'com.micr
43374341
43384342### <a name =" com.microsoft.QMoE " ></a ><a name =" com.microsoft.qmoe " >** com.microsoft.QMoE** </a >
43394343
4340- Int4 MoE
4344+ Quantized MoE
43414345
43424346#### Version
43434347
@@ -4348,10 +4352,14 @@ This version of the operator has been available since version 1 of the 'com.micr
43484352<dl >
43494353<dt ><tt >activation_type</tt > : string</dt >
43504354<dd >Activation function to use. Choose from relu, gelu, silu and identity. Default is relu</dd >
4355+ <dt ><tt >expert_weight_bits</tt > : int</dt >
4356+ <dd >Number of bits used in quantized weights. Default is 4 bits</dd >
43514357<dt ><tt >k</tt > : int</dt >
43524358<dd >Number of top experts to select from expert pool</dd >
43534359<dt ><tt >normalize_routing_weights</tt > : int</dt >
43544360<dd >Whether to normalize routing weights</dd >
4361+ <dt ><tt >use_sparse_mixer</tt > : int</dt >
4362+ <dd >Whether to use sparse mixer</dd >
43554363</dl >
43564364
43574365#### Inputs (7 - 11)
@@ -4362,19 +4370,19 @@ This version of the operator has been available since version 1 of the 'com.micr
43624370<dt ><tt >router_probs</tt > : T</dt >
43634371<dd >2D input tensor with shape (num_rows, num_experts)</dd >
43644372<dt ><tt >fc1_experts_weights</tt > : T1</dt >
4365- <dd >3D input tensor with shape (num_experts, hidden_size, inter_size / 2)</dd >
4373+ <dd >3D input tensor with shape (num_experts, hidden_size, inter_size) or (num_experts, hidden_size, inter_size / 2)</dd >
43664374<dt ><tt >fc1_scales</tt > : T</dt >
43674375<dd >2D input tensor with shape (num_experts, inter_size)</dd >
43684376<dt ><tt >fc1_experts_bias</tt > (optional) : T</dt >
43694377<dd >2D optional input tensor with shape (num_experts, inter_size)</dd >
43704378<dt ><tt >fc2_experts_weights</tt > : T1</dt >
4371- <dd >3D input tensor with shape (num_experts, inter_size, hidden_size / 2)</dd >
4379+ <dd >3D input tensor with shape (num_experts, inter_size, hidden_size) or (num_experts, inter_size, hidden_size / 2)</dd >
43724380<dt ><tt >fc2_scales</tt > : T</dt >
43734381<dd >2D input tensor with shape (num_experts, hidden_size)</dd >
43744382<dt ><tt >fc2_experts_bias</tt > (optional) : T</dt >
43754383<dd >2D optional input tensor with shape (num_experts, hidden_size)</dd >
43764384<dt ><tt >fc3_experts_weights</tt > (optional) : T1</dt >
4377- <dd >3D optional input tensor with shape (num_experts, hidden_size, inter_size / 2)</dd >
4385+ <dd >3D optional input tensor with shape (num_experts, hidden_size, inter_size) or (num_experts, hidden_size, inter_size / 2)</dd >
43784386<dt ><tt >fc3_scales</tt > (optional) : T</dt >
43794387<dd >2D optional input tensor with shape (num_experts, inter_size)</dd >
43804388<dt ><tt >fc3_experts_bias</tt > (optional) : T</dt >
0 commit comments