build: encode non-ASCII Latin1 characters as one byte in JS2C by joyeecheung · Pull Request #51605 · nodejs/node

@nodejs-github-bot added c++

Issues and PRs that require attention from people who are familiar with C++.

lib / src

Issues and PRs related to general changes in the lib or src directory.

needs-ci

PRs that need a full CI run.

labels

Jan 30, 2024

@joyeecheung joyeecheung changed the title tools: encode non-ASCII Latin1 characters as one byte in JS2C build: encode non-ASCII Latin1 characters as one byte in JS2C

Jan 30, 2024

@joyeecheung

Previously we had two encodings for JS files:

1. If a file contains only ASCII characters, encode it as a one-byte
  string (interpreted as uint8_t array during loading).
2. If a file contains any characters with code point above 127,
  encode it as a two-byte string (interpreted as uint16_t array
  during loading).

This was done because V8 only supports Latin-1 and UTF16 encoding
as underlying representation for strings. To store the JS code
as external strings to save encoding cost and memory overhead
we need to follow the representations supported by V8.
Notice that there is a gap in the Latin1 range (128-255) that we
encoded as two-byte, which was an undocumented TODO for a long
time. That was fine previously because then files that contained
code points beyond the 0-127 range contained code points >255.
Now we have undici which contains code points in the range 0-255
(minus a replaceable code point >255). So this patch adds handling
for the 128-255 range to reduce the size overhead caused by encoding
them as two-byte. This could reduce the size of the binary by
~500KB and helps future files with this kind of code points.

Drive-by: replace `’` with `'` in undici.js to make it a Latin-1
only string. That could be removed if undici updates itself to
replace this character in the comment.

Ethan-Arrowood

targos pushed a commit that referenced this pull request

Feb 19, 2024
Previously we had two encodings for JS files:

1. If a file contains only ASCII characters, encode it as a one-byte
  string (interpreted as uint8_t array during loading).
2. If a file contains any characters with code point above 127,
  encode it as a two-byte string (interpreted as uint16_t array
  during loading).

This was done because V8 only supports Latin-1 and UTF16 encoding
as underlying representation for strings. To store the JS code
as external strings to save encoding cost and memory overhead
we need to follow the representations supported by V8.
Notice that there is a gap in the Latin1 range (128-255) that we
encoded as two-byte, which was an undocumented TODO for a long
time. That was fine previously because then files that contained
code points beyond the 0-127 range contained code points >255.
Now we have undici which contains code points in the range 0-255
(minus a replaceable code point >255). So this patch adds handling
for the 128-255 range to reduce the size overhead caused by encoding
them as two-byte. This could reduce the size of the binary by
~500KB and helps future files with this kind of code points.

Drive-by: replace `’` with `'` in undici.js to make it a Latin-1
only string. That could be removed if undici updates itself to
replace this character in the comment.

PR-URL: #51605
Reviewed-By: Daniel Lemire <daniel@lemire.me>
Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>

zcbenz added a commit to zcbenz/node that referenced this pull request

Feb 21, 2024

zcbenz added a commit to zcbenz/node that referenced this pull request

Feb 21, 2024

zcbenz added a commit to zcbenz/node that referenced this pull request

Feb 21, 2024

zcbenz added a commit to zcbenz/node that referenced this pull request

Feb 23, 2024

zcbenz added a commit that referenced this pull request

Feb 23, 2024
This is a follow-up to #51605.

PR-URL: #51818
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz.nizipli@sentry.io>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>

marco-ippolito pushed a commit that referenced this pull request

Feb 26, 2024
This is a follow-up to #51605.

PR-URL: #51818
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz.nizipli@sentry.io>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>

marco-ippolito pushed a commit that referenced this pull request

Feb 26, 2024
This is a follow-up to #51605.

PR-URL: #51818
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz.nizipli@sentry.io>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>

marco-ippolito pushed a commit that referenced this pull request

Feb 27, 2024
This is a follow-up to #51605.

PR-URL: #51818
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz.nizipli@sentry.io>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>

rdw-msft pushed a commit to rdw-msft/node that referenced this pull request

Mar 20, 2024
Previously we had two encodings for JS files:

1. If a file contains only ASCII characters, encode it as a one-byte
  string (interpreted as uint8_t array during loading).
2. If a file contains any characters with code point above 127,
  encode it as a two-byte string (interpreted as uint16_t array
  during loading).

This was done because V8 only supports Latin-1 and UTF16 encoding
as underlying representation for strings. To store the JS code
as external strings to save encoding cost and memory overhead
we need to follow the representations supported by V8.
Notice that there is a gap in the Latin1 range (128-255) that we
encoded as two-byte, which was an undocumented TODO for a long
time. That was fine previously because then files that contained
code points beyond the 0-127 range contained code points >255.
Now we have undici which contains code points in the range 0-255
(minus a replaceable code point >255). So this patch adds handling
for the 128-255 range to reduce the size overhead caused by encoding
them as two-byte. This could reduce the size of the binary by
~500KB and helps future files with this kind of code points.

Drive-by: replace `’` with `'` in undici.js to make it a Latin-1
only string. That could be removed if undici updates itself to
replace this character in the comment.

PR-URL: nodejs#51605
Reviewed-By: Daniel Lemire <daniel@lemire.me>
Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>

richardlau pushed a commit that referenced this pull request

Mar 25, 2024
Previously we had two encodings for JS files:

1. If a file contains only ASCII characters, encode it as a one-byte
  string (interpreted as uint8_t array during loading).
2. If a file contains any characters with code point above 127,
  encode it as a two-byte string (interpreted as uint16_t array
  during loading).

This was done because V8 only supports Latin-1 and UTF16 encoding
as underlying representation for strings. To store the JS code
as external strings to save encoding cost and memory overhead
we need to follow the representations supported by V8.
Notice that there is a gap in the Latin1 range (128-255) that we
encoded as two-byte, which was an undocumented TODO for a long
time. That was fine previously because then files that contained
code points beyond the 0-127 range contained code points >255.
Now we have undici which contains code points in the range 0-255
(minus a replaceable code point >255). So this patch adds handling
for the 128-255 range to reduce the size overhead caused by encoding
them as two-byte. This could reduce the size of the binary by
~500KB and helps future files with this kind of code points.

Drive-by: replace `’` with `'` in undici.js to make it a Latin-1
only string. That could be removed if undici updates itself to
replace this character in the comment.

PR-URL: #51605
Reviewed-By: Daniel Lemire <daniel@lemire.me>
Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>

richardlau pushed a commit that referenced this pull request

Mar 25, 2024
This is a follow-up to #51605.

PR-URL: #51818
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz.nizipli@sentry.io>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>

richardlau pushed a commit that referenced this pull request

Mar 25, 2024
Previously we had two encodings for JS files:

1. If a file contains only ASCII characters, encode it as a one-byte
  string (interpreted as uint8_t array during loading).
2. If a file contains any characters with code point above 127,
  encode it as a two-byte string (interpreted as uint16_t array
  during loading).

This was done because V8 only supports Latin-1 and UTF16 encoding
as underlying representation for strings. To store the JS code
as external strings to save encoding cost and memory overhead
we need to follow the representations supported by V8.
Notice that there is a gap in the Latin1 range (128-255) that we
encoded as two-byte, which was an undocumented TODO for a long
time. That was fine previously because then files that contained
code points beyond the 0-127 range contained code points >255.
Now we have undici which contains code points in the range 0-255
(minus a replaceable code point >255). So this patch adds handling
for the 128-255 range to reduce the size overhead caused by encoding
them as two-byte. This could reduce the size of the binary by
~500KB and helps future files with this kind of code points.

Drive-by: replace `’` with `'` in undici.js to make it a Latin-1
only string. That could be removed if undici updates itself to
replace this character in the comment.

PR-URL: #51605
Reviewed-By: Daniel Lemire <daniel@lemire.me>
Reviewed-By: Ethan Arrowood <ethan@arrowood.dev>

richardlau pushed a commit that referenced this pull request

Mar 25, 2024
This is a follow-up to #51605.

PR-URL: #51818
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz.nizipli@sentry.io>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>

rdw-msft pushed a commit to rdw-msft/node that referenced this pull request

Mar 26, 2024
This is a follow-up to nodejs#51605.

PR-URL: nodejs#51818
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Joyee Cheung <joyeec9h3@gmail.com>
Reviewed-By: Yagiz Nizipli <yagiz.nizipli@sentry.io>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>

codebytere added a commit to electron/electron that referenced this pull request

Apr 11, 2024

codebytere added a commit to electron/electron that referenced this pull request

Apr 16, 2024

jkleinsc pushed a commit to electron/electron that referenced this pull request

Apr 17, 2024
* chore: bump node in DEPS to v20.12.0

* chore: update build_add_gn_build_files.patch

* chore: update patches

* chore: bump node in DEPS to v20.12.1

* chore: update patches

* build: encode non-ASCII Latin1 characters as one byte in JS2C

nodejs/node#51605

* crypto: use EVP_MD_fetch and cache EVP_MD for hashes

nodejs/node#51034

* chore: update filenames.json

* chore: bump node in DEPS to v20.12.2

* chore: update patches

* src: support configurable snapshot

nodejs/node#50453

* test: remove test-domain-error-types flaky designation

nodejs/node#51717

* src: avoid draining platform tasks at FreeEnvironment

nodejs/node#51290

* chore: fix accidentally deleted v8 dep

* lib: define FormData and fetch etc. in the built-in snapshot

nodejs/node#51598

* chore: rebase on main

* chore: remove stray log

---------

Co-authored-by: electron-roller[bot] <84116207+electron-roller[bot]@users.noreply.github.com>
Co-authored-by: Cheng <zcbenz@gmail.com>
Co-authored-by: Shelley Vohr <shelley.vohr@gmail.com>
Co-authored-by: PatchUp <73610968+patchup[bot]@users.noreply.github.com>

codebytere added a commit to electron/electron that referenced this pull request

May 31, 2024

codebytere added a commit to electron/electron that referenced this pull request

Jun 1, 2024
* chore: bump node in DEPS to v20.13.1

* chore: bump node in DEPS to v20.14.0

* chore: update build_add_gn_build_files.patch

* chore: update patches

* chore: update patches

* build: encode non-ASCII Latin1 characters as one byte in JS2C

nodejs/node#51605

* crypto: use EVP_MD_fetch and cache EVP_MD for hashes

nodejs/node#51034

* chore: update filenames.json

* chore: update patches

* src: support configurable snapshot

nodejs/node#50453

* test: remove test-domain-error-types flaky designation

nodejs/node#51717

* src: avoid draining platform tasks at FreeEnvironment

nodejs/node#51290

* chore: fix accidentally deleted v8 dep

* lib: define FormData and fetch etc. in the built-in snapshot

nodejs/node#51598

* chore: remove stray log

* crypto: enable NODE_EXTRA_CA_CERTS with BoringSSL

nodejs/node#52217

* test: skip test for dynamically linked OpenSSL

nodejs/node#52542

* lib, url: add a `windows` option to path parsing

nodejs/node#52509

* src: use dedicated routine to compile function for builtin CJS loader

nodejs/node#52016

* test: mark test as flaky

nodejs/node#52671

* build,tools: add test-ubsan ci

nodejs/node#46297

* src: preload function for Environment

nodejs/node#51539

* deps: update c-ares to 1.28.1

nodejs/node#52285

* chore: fixup

* events: extract addAbortListener for safe internal use

nodejs/node#52081

* module: print location of unsettled top-level await in entry points

nodejs/node#51999

* fs: add stacktrace to fs/promises

nodejs/node#49849

* chore: fixup indices

---------

Co-authored-by: electron-roller[bot] <84116207+electron-roller[bot]@users.noreply.github.com>
Co-authored-by: Cheng <zcbenz@gmail.com>
Co-authored-by: Shelley Vohr <shelley.vohr@gmail.com>
Co-authored-by: PatchUp <73610968+patchup[bot]@users.noreply.github.com>