ggml : add optional CPU backend context, support reusing threads, async compute

As recently seen in llama.cpp (https://github.com/ggerganov/llama.cpp/pull/5226), the cost of starting the threads of the CPU backend is not insignificant. To address this, I propose adding a new CPU context object that holds the threads and can reuse them between invocations. Additionally, this CPU context would behave as an asynchronous queue, so that multiple graph evaluations could be queued into the object. This would enable the implementation of pipeline parallelism with the CPU and GPU backends (ref: https://github.com/ggerganov/llama.cpp/pull/4918#issuecomment-1915609705).

Possible API:

```C
ggml_compute_context_t ggml_compute_context_init(int n_threads);
void ggml_graph_compute_async(ggml_compute_context_t context, struct ggml_cgraph * graph);
void ggml_synchronize(ggml_compute_context_t context);
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : add optional CPU backend context, support reusing threads, async compute #721

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ggml : add optional CPU backend context, support reusing threads, async compute #721

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions