Skip to content

[Data] [3/N] StandardScaler Preprocessor with arrow format#59906

Merged
alexeykudinkin merged 28 commits intoray-project:masterfrom
xinyuangui2:2_more_preprocessors
Jan 16, 2026
Merged

[Data] [3/N] StandardScaler Preprocessor with arrow format#59906
alexeykudinkin merged 28 commits intoray-project:masterfrom
xinyuangui2:2_more_preprocessors

Conversation

@xinyuangui2
Copy link
Contributor

@xinyuangui2 xinyuangui2 commented Jan 6, 2026

Adopt the arrow format for StandardScaler preprocessor instead of pandas.

Before (master with pandas):

Preprocessor Scenario Batch Size Total Rows Total Time (s) Throughput (rows/sec) Avg Latency (ms) P50 (ms) P95 (ms) P99 (ms) Min (ms) Max (ms)
StandardScaler Arrow 1 100 0.556 180 5.56 5.49 5.97 6.47 5.35 6.73
StandardScaler Arrow 5 500 0.547 914 5.47 5.42 5.72 6.18 5.30 7.13
StandardScaler Arrow 10 1,000 0.549 1,821 5.49 5.42 5.60 7.36 5.33 7.73
StandardScaler Arrow 20 2,000 0.546 3,663 5.46 5.44 5.65 5.76 5.33 6.14
StandardScaler Arrow 50 5,000 0.555 9,016 5.55 5.41 5.78 6.69 5.32 13.40
StandardScaler Arrow 100 10,000 0.540 18,507 5.40 5.39 5.53 5.64 5.29 6.09

After:

Preprocessor Scenario Batch Size Total Rows Total Time (s) Throughput (rows/sec) Avg Latency (ms) P50 (ms) P95 (ms) P99 (ms) Min (ms) Max (ms)
StandardScaler Arrow 1 100 0.054 1,840 0.54 0.53 0.58 0.62 0.52 0.75
StandardScaler Arrow 5 500 0.055 9,041 0.55 0.54 0.63 0.78 0.53 0.79
StandardScaler Arrow 10 1,000 0.056 18,013 0.56 0.54 0.61 0.67 0.53 0.75
StandardScaler Arrow 20 2,000 0.055 36,653 0.55 0.53 0.58 0.62 0.53 0.77
StandardScaler Arrow 50 5,000 0.055 90,670 0.55 0.54 0.59 0.61 0.52 0.91
StandardScaler Arrow 100 10,000 0.055 181,200 0.55 0.54 0.59 0.62 0.52 0.78

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a performant, Arrow-native transformation path for the OneHotEncoder preprocessor, which is a great enhancement. The implementation correctly uses PyArrow and vectorized NumPy operations for efficiency. The new tests are comprehensive, especially the parameterized tests that ensure consistency between the pandas and Arrow implementations. I have one suggestion to improve a test case to make it more robust and accurate.

@xinyuangui2 xinyuangui2 changed the title 2 more preprocessors [Data] [3/N] StandardScaler Preprocessor with arrow format Jan 8, 2026
@xinyuangui2 xinyuangui2 marked this pull request as ready for review January 8, 2026 07:41
@xinyuangui2 xinyuangui2 requested a review from a team as a code owner January 8, 2026 07:41
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Jan 8, 2026
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 added the go add ONLY when ready to merge, run all tests label Jan 8, 2026
xinyuangui2 and others added 2 commits January 14, 2026 13:29
Comment on lines +137 to +138
# Read all input columns first to avoid reading modified data when
# output_columns[i] == columns[j] for i < j
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is confusing -- there's no actual reading, you just get a reference to pa.ChunkedArray

@alexeykudinkin alexeykudinkin merged commit 2844be3 into ray-project:master Jan 16, 2026
6 checks passed
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jan 18, 2026
…ct#59906)

Adopt the arrow format for StandardScaler preprocessor instead of
pandas.

Before (master with pandas):

| Preprocessor | Scenario | Batch Size | Total Rows | Total Time (s) |
Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99
(ms) | Min (ms) | Max (ms) |

|--------------|----------|------------|------------|----------------|----------------------|------------------|----------|----------|----------|----------|----------|
| StandardScaler | Arrow | 1 | 100 | 0.556 | 180 | 5.56 | 5.49 | 5.97 |
6.47 | 5.35 | 6.73 |
| StandardScaler | Arrow | 5 | 500 | 0.547 | 914 | 5.47 | 5.42 | 5.72 |
6.18 | 5.30 | 7.13 |
| StandardScaler | Arrow | 10 | 1,000 | 0.549 | 1,821 | 5.49 | 5.42 |
5.60 | 7.36 | 5.33 | 7.73 |
| StandardScaler | Arrow | 20 | 2,000 | 0.546 | 3,663 | 5.46 | 5.44 |
5.65 | 5.76 | 5.33 | 6.14 |
| StandardScaler | Arrow | 50 | 5,000 | 0.555 | 9,016 | 5.55 | 5.41 |
5.78 | 6.69 | 5.32 | 13.40 |
| StandardScaler | Arrow | 100 | 10,000 | 0.540 | 18,507 | 5.40 | 5.39 |
5.53 | 5.64 | 5.29 | 6.09 |

After:

| Preprocessor | Scenario | Batch Size | Total Rows | Total Time (s) |
Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99
(ms) | Min (ms) | Max (ms) |

|--------------|----------|------------|------------|----------------|----------------------|------------------|----------|----------|----------|----------|----------|
| StandardScaler | Arrow | 1 | 100 | 0.054 | 1,840 | 0.54 | 0.53 | 0.58
| 0.62 | 0.52 | 0.75 |
| StandardScaler | Arrow | 5 | 500 | 0.055 | 9,041 | 0.55 | 0.54 | 0.63
| 0.78 | 0.53 | 0.79 |
| StandardScaler | Arrow | 10 | 1,000 | 0.056 | 18,013 | 0.56 | 0.54 |
0.61 | 0.67 | 0.53 | 0.75 |
| StandardScaler | Arrow | 20 | 2,000 | 0.055 | 36,653 | 0.55 | 0.53 |
0.58 | 0.62 | 0.53 | 0.77 |
| StandardScaler | Arrow | 50 | 5,000 | 0.055 | 90,670 | 0.55 | 0.54 |
0.59 | 0.61 | 0.52 | 0.91 |
| StandardScaler | Arrow | 100 | 10,000 | 0.055 | 181,200 | 0.55 | 0.54
| 0.59 | 0.62 | 0.52 | 0.78 |

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
jeffery4011 pushed a commit to jeffery4011/ray that referenced this pull request Jan 20, 2026
…ct#59906)

Adopt the arrow format for StandardScaler preprocessor instead of
pandas.

Before (master with pandas):

| Preprocessor | Scenario | Batch Size | Total Rows | Total Time (s) |
Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99
(ms) | Min (ms) | Max (ms) |

|--------------|----------|------------|------------|----------------|----------------------|------------------|----------|----------|----------|----------|----------|
| StandardScaler | Arrow | 1 | 100 | 0.556 | 180 | 5.56 | 5.49 | 5.97 |
6.47 | 5.35 | 6.73 |
| StandardScaler | Arrow | 5 | 500 | 0.547 | 914 | 5.47 | 5.42 | 5.72 |
6.18 | 5.30 | 7.13 |
| StandardScaler | Arrow | 10 | 1,000 | 0.549 | 1,821 | 5.49 | 5.42 |
5.60 | 7.36 | 5.33 | 7.73 |
| StandardScaler | Arrow | 20 | 2,000 | 0.546 | 3,663 | 5.46 | 5.44 |
5.65 | 5.76 | 5.33 | 6.14 |
| StandardScaler | Arrow | 50 | 5,000 | 0.555 | 9,016 | 5.55 | 5.41 |
5.78 | 6.69 | 5.32 | 13.40 |
| StandardScaler | Arrow | 100 | 10,000 | 0.540 | 18,507 | 5.40 | 5.39 |
5.53 | 5.64 | 5.29 | 6.09 |

After:

| Preprocessor | Scenario | Batch Size | Total Rows | Total Time (s) |
Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99
(ms) | Min (ms) | Max (ms) |

|--------------|----------|------------|------------|----------------|----------------------|------------------|----------|----------|----------|----------|----------|
| StandardScaler | Arrow | 1 | 100 | 0.054 | 1,840 | 0.54 | 0.53 | 0.58
| 0.62 | 0.52 | 0.75 |
| StandardScaler | Arrow | 5 | 500 | 0.055 | 9,041 | 0.55 | 0.54 | 0.63
| 0.78 | 0.53 | 0.79 |
| StandardScaler | Arrow | 10 | 1,000 | 0.056 | 18,013 | 0.56 | 0.54 |
0.61 | 0.67 | 0.53 | 0.75 |
| StandardScaler | Arrow | 20 | 2,000 | 0.055 | 36,653 | 0.55 | 0.53 |
0.58 | 0.62 | 0.53 | 0.77 |
| StandardScaler | Arrow | 50 | 5,000 | 0.055 | 90,670 | 0.55 | 0.54 |
0.59 | 0.61 | 0.52 | 0.91 |
| StandardScaler | Arrow | 100 | 10,000 | 0.055 | 181,200 | 0.55 | 0.54
| 0.59 | 0.62 | 0.52 | 0.78 |

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
…ct#59906)

Adopt the arrow format for StandardScaler preprocessor instead of
pandas.

Before (master with pandas):

| Preprocessor | Scenario | Batch Size | Total Rows | Total Time (s) |
Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99
(ms) | Min (ms) | Max (ms) |

|--------------|----------|------------|------------|----------------|----------------------|------------------|----------|----------|----------|----------|----------|
| StandardScaler | Arrow | 1 | 100 | 0.556 | 180 | 5.56 | 5.49 | 5.97 |
6.47 | 5.35 | 6.73 |
| StandardScaler | Arrow | 5 | 500 | 0.547 | 914 | 5.47 | 5.42 | 5.72 |
6.18 | 5.30 | 7.13 |
| StandardScaler | Arrow | 10 | 1,000 | 0.549 | 1,821 | 5.49 | 5.42 |
5.60 | 7.36 | 5.33 | 7.73 |
| StandardScaler | Arrow | 20 | 2,000 | 0.546 | 3,663 | 5.46 | 5.44 |
5.65 | 5.76 | 5.33 | 6.14 |
| StandardScaler | Arrow | 50 | 5,000 | 0.555 | 9,016 | 5.55 | 5.41 |
5.78 | 6.69 | 5.32 | 13.40 |
| StandardScaler | Arrow | 100 | 10,000 | 0.540 | 18,507 | 5.40 | 5.39 |
5.53 | 5.64 | 5.29 | 6.09 |

After:

| Preprocessor | Scenario | Batch Size | Total Rows | Total Time (s) |
Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99
(ms) | Min (ms) | Max (ms) |

|--------------|----------|------------|------------|----------------|----------------------|------------------|----------|----------|----------|----------|----------|
| StandardScaler | Arrow | 1 | 100 | 0.054 | 1,840 | 0.54 | 0.53 | 0.58
| 0.62 | 0.52 | 0.75 |
| StandardScaler | Arrow | 5 | 500 | 0.055 | 9,041 | 0.55 | 0.54 | 0.63
| 0.78 | 0.53 | 0.79 |
| StandardScaler | Arrow | 10 | 1,000 | 0.056 | 18,013 | 0.56 | 0.54 |
0.61 | 0.67 | 0.53 | 0.75 |
| StandardScaler | Arrow | 20 | 2,000 | 0.055 | 36,653 | 0.55 | 0.53 |
0.58 | 0.62 | 0.53 | 0.77 |
| StandardScaler | Arrow | 50 | 5,000 | 0.055 | 90,670 | 0.55 | 0.54 |
0.59 | 0.61 | 0.52 | 0.91 |
| StandardScaler | Arrow | 100 | 10,000 | 0.055 | 181,200 | 0.55 | 0.54
| 0.59 | 0.62 | 0.52 | 0.78 |

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ct#59906)

Adopt the arrow format for StandardScaler preprocessor instead of
pandas.

Before (master with pandas):

| Preprocessor | Scenario | Batch Size | Total Rows | Total Time (s) |
Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99
(ms) | Min (ms) | Max (ms) |

|--------------|----------|------------|------------|----------------|----------------------|------------------|----------|----------|----------|----------|----------|
| StandardScaler | Arrow | 1 | 100 | 0.556 | 180 | 5.56 | 5.49 | 5.97 |
6.47 | 5.35 | 6.73 |
| StandardScaler | Arrow | 5 | 500 | 0.547 | 914 | 5.47 | 5.42 | 5.72 |
6.18 | 5.30 | 7.13 |
| StandardScaler | Arrow | 10 | 1,000 | 0.549 | 1,821 | 5.49 | 5.42 |
5.60 | 7.36 | 5.33 | 7.73 |
| StandardScaler | Arrow | 20 | 2,000 | 0.546 | 3,663 | 5.46 | 5.44 |
5.65 | 5.76 | 5.33 | 6.14 |
| StandardScaler | Arrow | 50 | 5,000 | 0.555 | 9,016 | 5.55 | 5.41 |
5.78 | 6.69 | 5.32 | 13.40 |
| StandardScaler | Arrow | 100 | 10,000 | 0.540 | 18,507 | 5.40 | 5.39 |
5.53 | 5.64 | 5.29 | 6.09 |

After:

| Preprocessor | Scenario | Batch Size | Total Rows | Total Time (s) |
Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99
(ms) | Min (ms) | Max (ms) |

|--------------|----------|------------|------------|----------------|----------------------|------------------|----------|----------|----------|----------|----------|
| StandardScaler | Arrow | 1 | 100 | 0.054 | 1,840 | 0.54 | 0.53 | 0.58
| 0.62 | 0.52 | 0.75 |
| StandardScaler | Arrow | 5 | 500 | 0.055 | 9,041 | 0.55 | 0.54 | 0.63
| 0.78 | 0.53 | 0.79 |
| StandardScaler | Arrow | 10 | 1,000 | 0.056 | 18,013 | 0.56 | 0.54 |
0.61 | 0.67 | 0.53 | 0.75 |
| StandardScaler | Arrow | 20 | 2,000 | 0.055 | 36,653 | 0.55 | 0.53 |
0.58 | 0.62 | 0.53 | 0.77 |
| StandardScaler | Arrow | 50 | 5,000 | 0.055 | 90,670 | 0.55 | 0.54 |
0.59 | 0.61 | 0.52 | 0.91 |
| StandardScaler | Arrow | 100 | 10,000 | 0.055 | 181,200 | 0.55 | 0.54
| 0.59 | 0.62 | 0.52 | 0.78 |

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ct#59906)

Adopt the arrow format for StandardScaler preprocessor instead of
pandas.

Before (master with pandas):

| Preprocessor | Scenario | Batch Size | Total Rows | Total Time (s) |
Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99
(ms) | Min (ms) | Max (ms) |

|--------------|----------|------------|------------|----------------|----------------------|------------------|----------|----------|----------|----------|----------|
| StandardScaler | Arrow | 1 | 100 | 0.556 | 180 | 5.56 | 5.49 | 5.97 |
6.47 | 5.35 | 6.73 |
| StandardScaler | Arrow | 5 | 500 | 0.547 | 914 | 5.47 | 5.42 | 5.72 |
6.18 | 5.30 | 7.13 |
| StandardScaler | Arrow | 10 | 1,000 | 0.549 | 1,821 | 5.49 | 5.42 |
5.60 | 7.36 | 5.33 | 7.73 |
| StandardScaler | Arrow | 20 | 2,000 | 0.546 | 3,663 | 5.46 | 5.44 |
5.65 | 5.76 | 5.33 | 6.14 |
| StandardScaler | Arrow | 50 | 5,000 | 0.555 | 9,016 | 5.55 | 5.41 |
5.78 | 6.69 | 5.32 | 13.40 |
| StandardScaler | Arrow | 100 | 10,000 | 0.540 | 18,507 | 5.40 | 5.39 |
5.53 | 5.64 | 5.29 | 6.09 |

After:

| Preprocessor | Scenario | Batch Size | Total Rows | Total Time (s) |
Throughput (rows/sec) | Avg Latency (ms) | P50 (ms) | P95 (ms) | P99
(ms) | Min (ms) | Max (ms) |

|--------------|----------|------------|------------|----------------|----------------------|------------------|----------|----------|----------|----------|----------|
| StandardScaler | Arrow | 1 | 100 | 0.054 | 1,840 | 0.54 | 0.53 | 0.58
| 0.62 | 0.52 | 0.75 |
| StandardScaler | Arrow | 5 | 500 | 0.055 | 9,041 | 0.55 | 0.54 | 0.63
| 0.78 | 0.53 | 0.79 |
| StandardScaler | Arrow | 10 | 1,000 | 0.056 | 18,013 | 0.56 | 0.54 |
0.61 | 0.67 | 0.53 | 0.75 |
| StandardScaler | Arrow | 20 | 2,000 | 0.055 | 36,653 | 0.55 | 0.53 |
0.58 | 0.62 | 0.53 | 0.77 |
| StandardScaler | Arrow | 50 | 5,000 | 0.055 | 90,670 | 0.55 | 0.54 |
0.59 | 0.61 | 0.52 | 0.91 |
| StandardScaler | Arrow | 100 | 10,000 | 0.055 | 181,200 | 0.55 | 0.54
| 0.59 | 0.62 | 0.52 | 0.78 |

---------

Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants