Skip to content

Specify base-10 for int() for known strings and bytes#1397

Closed
inqueue wants to merge 1 commit intoelastic:masterfrom
inqueue:1379-base-int
Closed

Specify base-10 for int() for known strings and bytes#1397
inqueue wants to merge 1 commit intoelastic:masterfrom
inqueue:1379-base-int

Conversation

@inqueue
Copy link
Copy Markdown
Member

@inqueue inqueue commented Dec 8, 2021

With this commit, we explicitly specify base-10 for string and bytes int() conversions. Specifying the base is known to boost performance for int() conversions as mentioned in #1379. This optimization can only be
applied when we know the input to be a string or bytes.

While I would have liked to have added the optimization to more code paths, there is a limitation when we cannot guarantee non-numeric inputs. Python will throw a TypeError when attempting to convert a non-string with an explicit base, therefore we only designate base-10 when we know for certain the input will be a non-numeric type. A full review of every invocation of int() in the code base revealed we cannot always make that guarantee.

Closes #1379

@inqueue inqueue added the enhancement Improves the status quo label Dec 8, 2021
@inqueue inqueue added this to the 2.3.1 milestone Dec 8, 2021
@inqueue inqueue self-assigned this Dec 8, 2021
Copy link
Copy Markdown
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for raising this PR and going through the code base for potential improvements based on #1379.

Looking at the files here, it seems to me that none of these uses are related to runners or operations in the hot code path that can potentially cause loaddriver bottlenecks; they are rather used in setup tasks, before the benchmark starts, like downloading the corpora, configuring/starting elasticsearch and parsing cli arguments.

Therefore, I don't think they add value. If we don't have any uses of int() in hot code paths, I am leaning towards closing this PR and the corresponding issue. WDYT?

Copy link
Copy Markdown
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit on the fence here. Explicit is better than implicit, but this is also more likely to raise questions than anything else, and the performance benefit is zero: I'd lean towards rejecting this.

By the way, Jason, do you have examples of cases where we don't always send strings to int()?

@dliappis
Copy link
Copy Markdown
Contributor

dliappis commented Dec 9, 2021

It occurred to me that we could at least document this best practice (among others in the future perhaps) in a section that is linked in all the warnings callouts we currently have for custom parameter sources, custom runners and custom schedulers. WDYT?

@michaelbaamonde
Copy link
Copy Markdown
Contributor

we could at least document this best practice (among others in the future perhaps) in a section that is linked in all the warnings callouts

Take all of my +1s @dliappis.

Expanding this a bit, I'd love a section in the docs that explains what we mean by the "hot path"/"performance-critical code-path", but also more generally a high-level overview of why this is so important in the first place. Reference material on client-side bottlenecks would be great: why they're so crucial to avoid, how they manifest themselves, what we do in Rally to avoid them, etc. You can piece some of this together with what's documented already, but a conceptual overview would help a lot.

@inqueue
Copy link
Copy Markdown
Member Author

inqueue commented Dec 10, 2021

at least document this best practice (among others in the future perhaps) in a section that is linked in all the warnings callouts

I like the idea of a dedicated section explaining the hot path which outlines best practices around it. While I appreciate the warning to be careful, it would be useful to know how and why.

By the way, Jason, do you have examples of cases where we don't always send strings to int()?

Input parameters could be a string from CLI arguments or read as a string or integer from JSON:

total_pages = sys.maxsize if params.get("pages") == "all" else int(mandatory(params, "pages", self))

Input is read from rally.ini (string), but the default value is an integer:

downsample_factor = int(self.config.opts("reporting", "metrics.request.downsample.factor", mandatory=False, default_value=1))

self.sample_queue_size = int(self.config.opts("reporting", "sample.queue.size", mandatory=False, default_value=1 << 20))

I agree we are not gaining any value from this change. For a few cases, there may be some refactoring we could do to avoid int() entirely and/or set the base explicitly when we know we have the string (CLI args, rally.ini). My vote is to leave things how they are for now. WDYT?

@pquentin
Copy link
Copy Markdown
Member

I like the idea of a dedicated section explaining the hot path which outlines best practices around it. While I appreciate the warning to be careful, it would be useful to know how and why.

Would you like to work on this?

Input parameters could be a string from CLI arguments or read as a string or integer from JSON. [...] My vote is to leave things how they are for now. WDYT?

Thanks for capturing the issue. The reason I asked is that mypy won't be happy about it eventually, so we might as well fix if by making sure a given value is always a string or always an int. But leavings things as they are currently is also fine, and in any case we all agree that we can close this pull request.

@inqueue
Copy link
Copy Markdown
Member Author

inqueue commented Dec 15, 2021

Would you like to work on this?

Sure, I will work on the docs section about the hot path. Closing this PR.

@inqueue inqueue closed this Dec 15, 2021
@inqueue inqueue added :Docs Changes to the documentation and removed :Docs Changes to the documentation labels Jan 18, 2022
@inqueue inqueue removed this from the 2.3.1 milestone Jan 20, 2022
@pquentin
Copy link
Copy Markdown
Member

I opened a new issue to discuss the docs improvement: #1421

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improves the status quo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance by explicitly specifying the base in int() conversions

4 participants