Skip to content

Retry incomplete HTTP downloads #1504

@pquentin

Description

@pquentin

We have an internal track that downloads many files individually with esrally.track.loader.Downloader.download that eventually calls esrally.utils.net.download which uses urllib3 to download the data. This download function checks that we downloaded all the bytes as specified in the Content-Length header. If not, it simply fails:

esrally.exceptions.DataError: Download of [~/.rally/benchmarks/data/solutions/logs/system-syslog-logs/document-50.json.bz2] is corrupt. Downloaded [1866599] bytes but [26703576] bytes are expected. Please retry.     

In that instance, the data was downloaded from https://rally-tracks.elastic.co/observability/logging/system/infra-stats/system.syslog/raw/document-50.json.bz2. This is a proxy maintained by Elastic, and apparently sometimes it serves us incomplete results.

But why isn't urllib3 covering this for us? https://blog.petrzemek.net/2018/04/22/on-incomplete-http-reads-and-the-requests-library-in-python/ has all the details. The 3.0 branch of requests has died since then, but thankfully we use urllib3 directly, and this post taught me that there is an undocumented flag in urllib3 to cover our use case: urllib3/urllib3#949. (It will become the default in urllib3 v2).

So it appears that simply setting enforce_content_length and remove the custom checks will fix our issue.

Metadata

Metadata

Assignees

Labels

:UsabilityMakes Rally easier to usebugSomething's wronggood first issueSmall, contained changes that are good for newcomershelp wantedWe'd be happy about a community contribution

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions