Conversation
markjdb
left a comment
There was a problem hiding this comment.
Generally looks ok to me aside from the inline comment.
If I were trying to upstream this, I think the approach of storing 32 bytes per map entry is probably not going to fly; a lot of effort was put into making those compact. On my desktop there are 151489 map entries currently allocated, so this amounts to ~5MB of extra memory overhead.
On LP64 systems we do have a 4-byte hole after the wired_count field; I wonder if that could be repurposed as an index into a table of strings. For kernel: names this would be trivial, but we'd need a scheme for userspace to register names somehow, certainly more complex than the current patch.
| if (!vm_map_lookup_entry(map, start, &entry)) | ||
| entry = vm_map_entry_succ(entry); | ||
| for (; entry->start < end; entry = vm_map_entry_succ(entry)) | ||
| strlcpy(entry->name, name, sizeof(entry->name)); |
There was a problem hiding this comment.
Is it intentional that you don't clip VM map entries here? That is, the first map entry might contain pages before start, and the last map entry might contain pages after end, so this operation potentially tags pages outside of the specified range.
I could see some argument for doing it that way: the name is just a label and not relevant to any functionality, so maybe we should avoid clipping just to avoid overhead from creating extra vm map entries. On the other hand, it means that you're potentially mislabeling memory.
I think I'd clip the entries to ensure that the name applies only to [start, end]. In that case, vm_map_mergeable_neighbors() needs to be updated too.
There was a problem hiding this comment.
I was caught between two different approaches, really: (1) Try not to cause perturbation of underlying data structures and (2) This is a real property. I changed my mind partway through, and in particular changed the equality test to compare the names, and then failed to update this to do clipping. I'll look at fixing this.
| vm_map_lock(map); | ||
| VM_MAP_RANGE_CHECK(map, start, end); | ||
| if (!vm_map_lookup_entry(map, start, &entry)) | ||
| entry = vm_map_entry_succ(entry); |
There was a problem hiding this comment.
Do we want to raise an error if there is a hole in the specified range?
There was a problem hiding this comment.
I'd tend to say yes because that's something that can never happen on CHERI and it's best to have CHERI and non-CHERI operate similarly rather than raising confusing errors.
There was a problem hiding this comment.
On further reflection, we should pre-check that we're not spanning reservations here which also takes care of there being no holes.
I think this is closely linked to the query I posed about whether and how we want to handle namespacing. Simply taking it down to 16 or even 12 is surely not going to substantively impact any of the current names, and we could allocate a flag of some sort for 'originated in kernel or userlevel' to eliminate the use of a 'kernel:' prefix. However, as you say, it's offering userspace flexibility in naming that creates more of a problem, and could easily lead to annoying policy questions if nothing else. And I think if one starts to squeeze the names to 8 or below, one loses at least readability and likely also information. But I wonder if 8 bytes makes this feel a lot more palatable, or still feels too heavyweight. Perhaps if all we can easily get is 4 bytes, we are forced down the path of a string table, which raises lots of annoying questions. Regardless of the details there: I think ideally we'd produce something upstreamable, since in fact I personally have wanted this information for several projects since first writing procstat, I've just never quite gotten around to looking at it :-). |
brooksdavis
left a comment
There was a problem hiding this comment.
A few thoughts:
What should happen when two adjacent parts of a reservation are unmapped but have different names? If they don't merge the reservation will never transition to quarantined under the current logic so the mapping with never be revoked.
Assuming the above is resolved, what should happen if two adjacent quarantined entries have different names? Not merging them remains correct so maybe it's fine, but I'd have to give the code a hard look. We probably at least want to allow merging of neighbors once we've picked an entry to revoke.
| vm_map_lock(map); | ||
| VM_MAP_RANGE_CHECK(map, start, end); | ||
| if (!vm_map_lookup_entry(map, start, &entry)) | ||
| entry = vm_map_entry_succ(entry); |
There was a problem hiding this comment.
I'd tend to say yes because that's something that can never happen on CHERI and it's best to have CHERI and non-CHERI operate similarly rather than raising confusing errors.
sys/kern/syscalls.master
Outdated
| size_t size | ||
| ); | ||
| } | ||
| 592 AUE_MSETNAME STD|CAPENABLED { |
There was a problem hiding this comment.
This syscall number can't be used, it'll conflict with upstream, which is already up to 599(!).
I see that the range 258-271 is reserved upstream and CheriBSD uses a few numbers in that range; perhaps we should take a number from there instead?
brooksdavis
left a comment
There was a problem hiding this comment.
vm_map_msetname still needs needs clipping and should be limited to a single reservation.
I'm not sure how the latest libc wrapper actually works.
sys/vm/vm_map.c
Outdated
| #if 0 | ||
| else if (new_entry->object.vm_object == NULL && | ||
| (new_entry->eflags & MAP_ENTRY_GUARD) != 0) | ||
| strlcpy(new_entry->name, "kernel:vm_guard", | ||
| sizeof(new_entry->name)); | ||
| else if (new_entry->object.vm_object == NULL && | ||
| (new_entry->eflags & MAP_ENTRY_STACK_GAP) != 0) | ||
| strlcpy(new_entry->name, "kernel:vm_stackgap", | ||
| sizeof(new_entry->name)); | ||
| else if (new_entry->object.vm_object == NULL && | ||
| (new_entry->eflags & MAP_ENTRY_GROWS_DOWN) != 0) | ||
| strlcpy(new_entry->name, "kernel:vm_stack", | ||
| sizeof(new_entry->name)); | ||
| #endif |
| vm_map_lock(map); | ||
| VM_MAP_RANGE_CHECK(map, start, end); | ||
| if (!vm_map_lookup_entry(map, start, &entry)) | ||
| entry = vm_map_entry_succ(entry); |
There was a problem hiding this comment.
On further reflection, we should pre-check that we're not spanning reservations here which also takes care of there being no holes.
| # Also use all the syscall .o files from libsys_pic (libsys is always NO_SSP): | ||
| _libsys_other_objects= fstat fstatat fstatfs syscall \ | ||
| cerror geteuid getegid sigfastblock munmap mprotect \ | ||
| cerror geteuid getegid sigfastblock munmap mprotect msetname \ |
There was a problem hiding this comment.
I think you might also need _msetname for the __sys_msetname definition?
| os_pages_unmap(result, size); | ||
| return true; | ||
| } | ||
| (void)msetname(result, size, "jemalloc:pages_commit_impl"); |
There was a problem hiding this comment.
This seems fine to start, but I wonder if we want to be capturing the "committed" state of pages.
sys/vm/vm_map.c
Outdated
| FEATURE(vm_msetname, "Kernel support for msetname(2)"); | ||
|
|
There was a problem hiding this comment.
I think you've moved on from using this and maybe should drop this commit.
Co-authored-by: Brooks Davis <brooks@one-eyed-alien.net>
Co-authored-by: Brooks Davis <brooks@one-eyed-alien.net>
Co-authored-by: Brooks Davis <brooks@one-eyed-alien.net>
Co-authored-by: Brooks Davis <brooks@one-eyed-alien.net>
Co-authored-by: Brooks Davis <brooks@one-eyed-alien.net>
Co-authored-by: Brooks Davis <brooks@one-eyed-alien.net>
This change adds a new field
nametostruct vm_map_entrythat contains an optional name for anonymous memory mappings that can be specified by kernel KPIs, such as a newvm_map_insert_name(9), and also from userspace with a new system callmsetname(9). It also adds calls tomsetname()in various key bits of userlevel such asrtld,libc, andlibthrto label various anonymous memory mappings such as those used for BSS, user-defined stack mappings, heap mappings, and so on.This prototype has a number of limitations including:
msetname(2)is the right API.vm_map_fixed(9)rather than add new_name()variants.msetname(2). Currently I'm using a subsystem:instance tuple, but perhaps there's something better.Typical output as below: