Retry incomplete HTTP downloads

We have an internal track that downloads many files individually with `esrally.track.loader.Downloader.download` that eventually calls `esrally.utils.net.download` which uses urllib3 to download the data. This `download` function checks that we downloaded all the bytes as specified in the Content-Length header. If not, it simply fails:

```
esrally.exceptions.DataError: Download of [~/.rally/benchmarks/data/solutions/logs/system-syslog-logs/document-50.json.bz2] is corrupt. Downloaded [1866599] bytes but [26703576] bytes are expected. Please retry.     
```

In that instance, the data was downloaded from https://rally-tracks.elastic.co/observability/logging/system/infra-stats/system.syslog/raw/document-50.json.bz2. This is a proxy maintained by Elastic, and apparently sometimes it serves us incomplete results.

But why isn't urllib3 covering this for us? https://blog.petrzemek.net/2018/04/22/on-incomplete-http-reads-and-the-requests-library-in-python/ has all the details. The 3.0 branch of requests has died since then, but thankfully we use urllib3 directly, and this post taught me that there is an undocumented flag in urllib3 to cover our use case: https://github.com/urllib3/urllib3/pull/949. (It will become the default in urllib3 v2).

So it appears that simply setting `enforce_content_length` and remove the custom checks will fix our issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry incomplete HTTP downloads #1504

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Retry incomplete HTTP downloads #1504

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions