As explained in basic flow of primitive execution for dynamic shape from Overall flow flow for dynamic shape, several preprocessing steps are performed before setting arguments to kernel and executing selected impl.
update_shape- when the input shape changes, calculate and change the output shape and perform shape inference so that the shape is propagated to the next node.update_impl- depending on the changed shape,primitive_implis retrieved from in-memory cache or new impl is selected.realloc_if_needed- allocates new output memory if necessary.
The following is a description for some of the representative preprocessing steps for dynamic shape execution.
To support dynamic shape in GPU plugin, cldnn::layout uses ov::PartialShape to express shape. While the existing cldnn::tensor does not support dynamic shape and has limitations in rank, ov::PartialShape supports static and dynamic dimensions and has no limitations in rank. And when creating cldnn::primitive from ov::op, ov::PartialShape that ov::op already has is directly used.
Note: In the execution flow for the existing static shape in GPU plugin, the shape of
ov::opmay be transformed intoov::tensorand used, so when creatingcldnn::primitivefromov::op, it is separated from the dynamic shape execution flow. When buildingcldnn::program, if there is at least one dynamic node among the nodes,ov::intel_gpu::allow_new_shape_inferproperty is set (link) and execution of static shape and dynamic shape is separated through this property duringcldnn::primitivecreation. It will be integrated in the future when GPU plugin fully supports dynamic shape.
When the input shape of the model changes, the input shape of the current primitive is also updated by checking whether the input shape has changed, and the output shape is calculated through the input shape, then this shape is propagated to the next primitive on shape inference stage.
Details on how to execute shape inference through primitive_inst::update_shape when executing primitive in GPU plugin for dynamic shape are as follows:
- In the basic flow that executes primitive, there is a runtime optimization stage (i.e.
primitive_inst::do_runtime_in_place_concat(link)) that runs beforeupdate_shape(). At this time, ifupdate_shape()has already been executed by another primitive, setupdate_shape_done_by_otherto TRUE. Therefore, ifupdate_shape_done_by_otheris TRUE,update_shape()is skipped. (link) - First, output layouts of
kernel_impl_paramsfrom the dependencies ofprimitive_instare compared with the input layouts ofkernel_impl_paramsof the current primitive. If changed, the changed shape is updated to input layouts ofkernel_impl_params. (link) - Set
_shape_changedto TRUE if the input shape has changed. (link) - If the current node is
shape_ofand the input shape has not changed, reset_shape_changedto FALSE and skipupdate_shape(). (link) - If the current node is
shape_ofsubgraph, check dependentshape_ofprimitives and skipupdate_shape()if the shape has not changed. (link) update_shape()is skipped if any of the following conditions hold: the input shape has not changed, the node generates dynamic output (e.g.Nonzero,Unique), or the output layouts ofkernel_impl_paramsare already static. (link)- In static shape execution, data for additional inputs that determine the output shape are set as attributes when creating
cldnn::primitive. In dynamic shape execution, if that data is stored in the output memory of a preceding node, execution waits until those dependent nodes complete. To determine which input nodes have memory dependencies, mostprogram_nodes defineget_shape_infer_dependencies(). The dependency information (index and memory for each dependent input node) is collected from the current node, stored in amap, and the corresponding primitive events are added to an event list to await completion. Finally, the populated map is saved inmemory_depsofkernel_impl_params. (link) - There are two APIs for output shape calculation on
program_node:calc_output_layout()for static shape execution andcalc_output_layouts()for dynamic shape execution. In this step,calc_output_layouts()is called, which invokes theshape_infer()API ofov::opwith the updated input layouts fromkernel_impl_params, the primitive's attributes, andmemory_deps, and returns output layouts as a vector. The newly calculated output layout is then written back tooutput_layoutsinkernel_impl_params(link)struct program_node { ... public: layout calc_output_layout() const; std::vector<layout> calc_output_layouts() const; }
- If there is fused operation in
kernel_impl_params, the output layout of the descriptor is also updated withov::PartialShapeof updated output layout. (link)
If primitive_impl is created or updated through update_impl(), and it is a weightable node (e.g. convolution, deconvolution, fc), the weight should be reordered to the layout required by kernel as needed. The following describes the processes performed in update_weights().
- If impl is nullptr or the current node is not weightable node,
update_weight()is skipped. (link) - Create reorder kernel params (i.e.
kernel_impl_paramsfor weights reorder) fromWeightsReorderParamsofprimitive_inst. (link) - In cases where weights reorder is not necessary, if weights were previously reordered, incorrect memory buffer is allocated, so reset reordered weights cache to original weight memory layout. (link)
- If weights reorder is necessary, update the weight layout of
kernel_impl_paramsto the output layout of reorder kernel params. This is the expected layout. (link)- If the expected layout hits reordered weights cache, it is reused.
- If the expected layout is compatible with the original layout, the original weights memory is reinterpreted and added to reordered weights cache without the need for reordering.
- If the expected layout misses reordered weights cache, retrieve a cached reorder impl from
implementations cacheusingreorder kernel params, or create a new reorder impl throughWeightsReordersFactoryand set the compiled kernel on it. Add the impl toimplementation cache. Check whether the weights memory can be reused inreordered weights cache; if so, reuse it, otherwise allocate a new buffer. Updatereordered weights cacheaccordingly. Finally, usekernel_arguments_data()to set kernel arguments in the reorder impl and execute the kernel.
In the case of static shape execution, output memory is allocated when creating primitive_inst, but in dynamic shape execution, output memory is allocated before arguments are set to kernel and execution. The following describes the processes performed in realloc_if_needed().
- If the current node is
concatand has 1 user,can_be_optimized()is TRUE butallocation_done_by_otheris FALSE (i.e. not yet allocated by another node), executeconcat'srealloc_if_needed()and setallocation_done_by_otherto TRUE. Also, use concat's output memory as the output memory of the current node and skiprealloc_if_needed(). (link) - For better performance, if fake aligned shape is used when executing the kernel (e.g.
fully_connected), the input and output shapes ofkernel_impl_paramsare updated accordingly. A more detailed explanation will be added as a separate section later (TBD). (link) - If the node is
input_layout,realloc_if_needed()is skipped because it is assumed to always use external memory. (link) - Check whether output memory is already allocated and the requested buffer size is smaller than the current buffer size, and store the result in
can_reuse_buffer. (link) - If the current node is
concatand bothcan_be_optimized()andallocation_done_by_otherare TRUE,realloc_if_needed()is skipped. (link) ShapePredictorpredicts a preallocation shape from the current shape and data type, and updates the output layout shape ofkernel_impl_paramsaccordingly. A more detailed explanation will be added as a separate section later (TBD). (link)- If
can_reuse_bufferis TRUE,reusedof output memory is set to TRUE and output memory is updated with reinterpreted buffer. (link) - If
can_reuse_bufferis FALSE, reallocate withallocate_outputs()to set the output memory and updatemax_output_layout_size. (link) - Get internal buffer layouts from the current
primitive_impl. (link)- If the previously allocated intermediate memory can be reused, the intermediate memory is updated with reinterpreted buffer.
- If it cannot be reused, allocate a new buffer through
allocate_internal_buffer()to update or add a new intermediate memory that has already been allocated.