fix(stmt-html): Fix embedded Buffer processing performance issue.#8748
fix(stmt-html): Fix embedded Buffer processing performance issue.#8748mcourteaux merged 4 commits intohalide:mainfrom
Conversation
742b880 to
58cf410
Compare
src/StmtToHTML.cpp
Outdated
| asm_stream << line << "\n"; | ||
| if (line.length() > 500) { | ||
| // Very long lines in the assembly are typically the _gpu_kernel_sources | ||
| // or other buffers (such as model weights) as a raw ASCII block in the |
There was a problem hiding this comment.
I would say something more realistic like "static LUTs" here... model weights are so large, they should be ImageParams.
rtzam
left a comment
There was a problem hiding this comment.
Just tested locally on the big LLM test case. The speedup is from "practically non-terminating" to ~40 seconds for a 1 layer model. Huge improvements! Thanks for the change.
| } | ||
| } | ||
|
|
||
| start = end + 1; |
There was a problem hiding this comment.
The fact that start can ever point past the end of the string_view made me uneasy. I can't wait for std::views::split to be available (c++23)
There was a problem hiding this comment.
Here's my take on a C++23 version, for posterity.
void generate(std::string_view code, std::map<uint64_t, std::regex>& markers,
std::map<uint64_t, int>& lnos) {
static constexpr std::string_view marker_prefix = "%\"";
std::size_t lno = 1;
for (auto&& chunk : std::views::split(code, '\n')) {
std::string_view line(chunk.begin(), chunk.end());
if (line.contains(marker_prefix)) {
std::erase_if(markers, [&](const auto& kv) {
const auto& [node, regex] = kv;
if (std::regex_search(line.begin(), line.end(), regex)) {
lnos[node] = lno;
return true;
}
return false;
});
}
lno++;
}
}There was a problem hiding this comment.
Well the loop condition should put you at ease. But indeed, a well-tested standard function would be nice.
There was a problem hiding this comment.
Well the loop condition should put you at ease.
It does 🙂
Do you know where the remaining 40 seconds are spent? More stuff to blame in StmtToHTML? I suspect loading the asm file can be slow (in |
|
So I've re-profiled and the remaining runtime is:
|
|
Duplicating the IR refers the Module clone, right? So nothing I want to do there. 20% of 40s is still 8s. Eight seconds for loading a text file... Let me try to optimize this either way. This is still a bit too ridiculous to my taste 😝 |
… HTML. Move-optimized replace_all: allow for no-copy execution when target string is not found.
I pushed another patch for loading faster. This should pretty much eliminate the loading time. Curious to see what the result is if you feel like profiling it again. 😄 |
|
I'm measuring a difference in the generation of HTML (~10% faster). Nice!! |
|
I'm looking at LLVM's libcxx implementation of basic_string<_CharT, _Traits, _Allocator>::append(_ForwardIterator __first, _ForwardIterator __last) {
size_type __sz = size();
size_type __cap = capacity();
size_type __n = static_cast<size_type>(std::distance(__first, __last));
if (__n) {
if (__string_is_trivial_iterator<_ForwardIterator>::value && !__addr_in_range(*__first)) {
if (__cap - __sz < __n)
__grow_by_without_replace(__cap, __sz + __n - __cap, __sz, __sz, 0);
__annotate_increase(__n);
auto __end = __copy_non_overlapping_range(__first, __last, std::__to_address(__get_pointer() + __sz));
traits_type::assign(*__end, value_type());
__set_size(__sz + __n);
} else {
const basic_string __temp(__first, __last, __alloc()); //// <<< HERE!
append(__temp.data(), __temp.size());
}
}
return *this;
}What a waste... It allocates the |
Initial changes to address #8717.
This does not address the Module deep-copy behavior copying Buffer contents. This merely addresses the string processing and avoids some copies.
Fixes #8752
Drive-by optimization: Change the signature of replace_all() to take the subject-string to be passed by value. This now allows for making replace-chains in this pattern:
Without making copies in case the target string is not found in the subject string.