fix(stmt-html): Fix embedded Buffer processing performance issue. by mcourteaux · Pull Request #8748 · halide/Halide

mcourteaux · 2025-08-14T13:02:39Z

Initial changes to address #8717.

This does not address the Module deep-copy behavior copying Buffer contents. This merely addresses the string processing and avoids some copies.

Fixes #8752

Drive-by optimization: Change the signature of replace_all() to take the subject-string to be passed by value. This now allows for making replace-chains in this pattern:

std::string s = ...;
s = replace_all(std::move(s), ..., ...);
s = replace_all(std::move(s), ..., ...);
s = replace_all(std::move(s), ..., ...);
s = replace_all(std::move(s), ..., ...);

Without making copies in case the target string is not found in the subject string.

src/Module.cpp

alexreinking · 2025-08-14T15:49:13Z

src/StmtToHTML.cpp

-            asm_stream << line << "\n";
+            if (line.length() > 500) {
+                // Very long lines in the assembly are typically the _gpu_kernel_sources
+                // or other buffers (such as model weights) as a raw ASCII block in the


I would say something more realistic like "static LUTs" here... model weights are so large, they should be ImageParams.

rtzam

Just tested locally on the big LLM test case. The speedup is from "practically non-terminating" to ~40 seconds for a 1 layer model. Huge improvements! Thanks for the change.

alexreinking · 2025-08-14T18:07:19Z

src/StmtToHTML.cpp

+                }
            }

+            start = end + 1;


The fact that start can ever point past the end of the string_view made me uneasy. I can't wait for std::views::split to be available (c++23)

Here's my take on a C++23 version, for posterity.

void generate(std::string_view code, std::map<uint64_t, std::regex>& markers, std::map<uint64_t, int>& lnos) { static constexpr std::string_view marker_prefix = "%\""; std::size_t lno = 1; for (auto&& chunk : std::views::split(code, '\n')) { std::string_view line(chunk.begin(), chunk.end()); if (line.contains(marker_prefix)) { std::erase_if(markers, [&](const auto& kv) { const auto& [node, regex] = kv; if (std::regex_search(line.begin(), line.end(), regex)) { lnos[node] = lno; return true; } return false; }); } lno++; } }

https://gcc.godbolt.org/z/W9se9Gn61

Well the loop condition should put you at ease. But indeed, a well-tested standard function would be nice.

Well the loop condition should put you at ease.

It does 🙂

mcourteaux · 2025-08-14T19:00:34Z

Just tested locally on the big LLM test case. The speedup is from "practically non-terminating" to ~40 seconds for a 1 layer model. Huge improvements! Thanks for the change.

Do you know where the remaining 40 seconds are spent? More stuff to blame in StmtToHTML? I suspect loading the asm file can be slow (in load_asm_code()), but didn't immediately see a way to improve that while keeping the code readable.

rtzam · 2025-08-14T20:29:01Z

So I've re-profiled and the remaining runtime is:

~20% load_asm_code (just copying and moving string data)
~50% duplicating the IR
~30% LLVM compilation
Nothing we can really do about LLVM here which leaves the IR duplication ¯_(ツ)_/¯

mcourteaux · 2025-08-14T20:39:35Z

Duplicating the IR refers the Module clone, right? So nothing I want to do there. I just wonder if cloning the IR is actually needed, as IR is supposed to be immutable AFIAK... Perhaps just assigning the IR to take a snapshot of what the IR looked like at the point it's considered the "conceptual version" might suffice.

20% of 40s is still 8s. Eight seconds for loading a text file... Let me try to optimize this either way. This is still a bit too ridiculous to my taste 😝

… HTML. Move-optimized replace_all: allow for no-copy execution when target string is not found.

mcourteaux · 2025-08-14T21:57:35Z

~20% load_asm_code (just copying and moving string data)

I pushed another patch for loading faster. This should pretty much eliminate the loading time. Curious to see what the result is if you feel like profiling it again. 😄

rtzam · 2025-08-14T22:59:50Z

I'm measuring a difference in the generation of HTML (~10% faster). Nice!!
The hottest path I'm observing is still Halide::PipelineHTMLInspector::load_asm_code() where pretty much all the time is spent in string's push_back() method.
Thanks again for all the work!!

mcourteaux · 2025-08-15T08:16:11Z

I'm looking at LLVM's libcxx implementation of std::string::append(iter, iter):

basic_string<_CharT, _Traits, _Allocator>::append(_ForwardIterator __first, _ForwardIterator __last) {
  size_type __sz  = size();
  size_type __cap = capacity();
  size_type __n   = static_cast<size_type>(std::distance(__first, __last));
  if (__n) {
    if (__string_is_trivial_iterator<_ForwardIterator>::value && !__addr_in_range(*__first)) {
      if (__cap - __sz < __n)
        __grow_by_without_replace(__cap, __sz + __n - __cap, __sz, __sz, 0);
      __annotate_increase(__n);
      auto __end = __copy_non_overlapping_range(__first, __last, std::__to_address(__get_pointer() + __sz));
      traits_type::assign(*__end, value_type());
      __set_size(__sz + __n);
    } else {
      const basic_string __temp(__first, __last, __alloc());  //// <<< HERE!
      append(__temp.data(), __temp.size());
    }
  }
  return *this;
}

What a waste... It allocates the __temp string... there is a redundant copy happening for no reason. The other overload taking a char* and size_t doesn't do this pointless copy. I'll change it to that.

mcourteaux requested review from alexreinking and rtzam August 14, 2025 13:02

mcourteaux added performance code_cleanup No functional changes. Reformatting, reorganizing, or refactoring existing code. labels Aug 14, 2025

mcourteaux marked this pull request as ready for review August 14, 2025 13:58

fix(stmt-html): Fix embedded Buffer processing performance issue.

58cf410

mcourteaux force-pushed the html-buffer-perf branch from 742b880 to 58cf410 Compare August 14, 2025 14:07

mcourteaux requested a review from halidebuildbots August 14, 2025 14:44

alexreinking reviewed Aug 14, 2025

View reviewed changes

src/Module.cpp Show resolved Hide resolved

alexreinking reviewed Aug 14, 2025

View reviewed changes

rtzam approved these changes Aug 14, 2025

View reviewed changes

alexreinking approved these changes Aug 14, 2025

View reviewed changes

alexreinking mentioned this pull request Aug 14, 2025

Use std::ranges when available #8753

Open

Improve loading performance of the assembly file when generating Stmt…

303270e

… HTML. Move-optimized replace_all: allow for no-copy execution when target string is not found.

mcourteaux requested a review from alexreinking August 14, 2025 22:01

Mama mia. Actually do the thing I intended to do.

3ece322

Avoid using inefficient std::string::append(iter, iter).

0752310

mcourteaux removed the request for review from halidebuildbots August 15, 2025 08:23

mcourteaux merged commit 8e965e7 into halide:main Aug 15, 2025
3 of 6 checks passed

BrewTestBot mentioned this pull request Sep 16, 2025

halide 21.0.0 Homebrew/homebrew-core#244220

Merged

Conversation

mcourteaux commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

alexreinking Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

rtzam left a comment

Choose a reason for hiding this comment

Uh oh!

alexreinking Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

alexreinking Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

mcourteaux Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

alexreinking Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

mcourteaux commented Aug 14, 2025

Uh oh!

rtzam commented Aug 14, 2025

Uh oh!

mcourteaux commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcourteaux commented Aug 14, 2025

Uh oh!

rtzam commented Aug 14, 2025

Uh oh!

mcourteaux commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mcourteaux commented Aug 14, 2025 •

edited

Loading

mcourteaux commented Aug 14, 2025 •

edited

Loading