-
Notifications
You must be signed in to change notification settings - Fork 15
feat: add CSV conversion command to ensrainbow CLI #1136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🦋 Changeset detectedLatest commit: b02b7f1 The changes in this PR will be included in the next version bump. This PR includes changesets to release 14 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
- Introduced `convert-csv` command for converting CSV files to .ensrainbow format. - Added support for single and two-column CSV formats. - Implemented error handling for invalid CSV data. - Created tests for various CSV scenarios, including special characters and invalid formats. - Updated package dependencies to include `csv-simple-parser` for CSV parsing.
- Added new command-line options for CSV conversion: `--silent`, `--disable-dedup`, `--cache-size`, `--use-bloom-filter`, and `--bloom-filter-size`. - Implemented a deduplication database using ClassicLevel with optional Bloom filter for faster processing. - Updated the conversion process to support deduplication and improved memory management. - Enhanced logging for large file processing and added tests for new deduplication features.
- Added a function to estimate memory usage of Maps for better tracking. - Reduced default cache size in DeduplicationDB from 10000 to 1000. - Enhanced backpressure handling during CSV writing to prevent memory overflow. - Updated logging to include output backpressure events and improved performance for large files. - Streamlined the CSV processing to operate in a completely sequential manner.
6f4803c to
721a50d
Compare
- Removed unused command-line options for deduplication and Bloom filter from the CLI interface. - Updated default progress interval from 10000 to 50000 records for improved performance. - Enhanced logging for file processing and memory management during CSV conversion. - Cleaned up code for better readability and maintainability.
lightwalker-eth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djstrong Great work here 👍 Reviewed and shared some suggestions. Appreciate your advice 👍
| "ensrainbow": patch | ||
| --- | ||
|
|
||
| feat: add CSV conversion command to ensrainbow CLI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| feat: add CSV conversion command to ensrainbow CLI | |
| feat: add CSV conversion command to ensrainbow CLI to convert rainbow tables from CSV format to ensrainbow format |
| ```bash title="Convert legacy SQL data" | ||
| pnpm run convert --input-file path/to/ens_names.sql.gz --output-file subgraph-0.ensrainbow | ||
| ``` | ||
| - **SQL Conversion**: Convert legacy ENS Subgraph data (`ens_names.sql.gz`) using `pnpm run convert` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be a problem to flip these? In other words, the default conversion case moving forward will be converting from CSV files. Converting from SQL files is what should be the special case.
Therefore it seems to me that convert should convert a CSV and convert-sql should convert a legacy SQL file such as the legacy ENS Subgraph data.
What do you think?
| ```bash title="Convert legacy SQL data" | ||
| pnpm run convert --input-file path/to/ens_names.sql.gz --output-file subgraph-0.ensrainbow | ||
| ``` | ||
| - **SQL Conversion**: Convert legacy ENS Subgraph data (`ens_names.sql.gz`) using `pnpm run convert` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For converting legacy ENS Subgraph data, suggest including a link to https://github.com/graphprotocol/ens-rainbow
| @@ -0,0 +1,675 @@ | |||
| --- | |||
| title: Creating ENSRainbow Files | |||
| description: Complete guide to creating .ensrainbow files from SQL dumps and CSV data. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| description: Complete guide to creating .ensrainbow files from SQL dumps and CSV data. | |
| description: Complete guide to creating .ensrainbow files. |
| sidebar: | ||
| label: Creating Files | ||
| order: 3 | ||
| keywords: [ensrainbow, file creation, conversion, sql, csv] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| keywords: [ensrainbow, file creation, conversion, sql, csv] | |
| keywords: [ensrainbow, file creation, conversion, csv] |
Goal: The .sql file conversion case is a really niche use case. Happy for us to include docs about it, however we should limit our promotion of the .sql conversion case. It exists only for legacy purposes. In general we should only talk about the .csv conversion case.
| } | ||
| const maybeLabelHash = providedHash.startsWith("0x") ? providedHash : `0x${providedHash}`; | ||
| try { | ||
| const labelHash = labelHashToBytes(maybeLabelHash as LabelHash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, where here did we validate that the provided labelhash is the correct labelhash? Or was there a performance issue if we did that?
At the very minimum I would want to see us add a check here that the bytes we are given here is the expected length of a valid labelhash. Ex: "0x1234" is not enough bytes and should be invalid. Or is that already handled inside labelHashToBytes?
| label: label, | ||
| }; | ||
| } else { | ||
| // Two columns: validate and use provided hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see here it says "validate" but I don't see full validation?
| outputStream: NodeJS.WritableStream, | ||
| lineNumber: number, | ||
| existingDb: ENSRainbowDB | null, | ||
| dedupDb: DeduplicationDB, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we removed DeduplicationDb and instead used existingDb for this purpose?
In other words, what if we changed the convert command so that it was always a combined "convert + ingest"?
In other words:
- If we ingest from a .ensrainbow file then it doesn't need to output a new .ensrainbow file
- If we ingest from a .csv file or a .sql file then it might initialize / update the
existingDbwhile also producing as output the incremental .ensrainbow file.
If we did this then it seems we could ingest as we convert and then remove the need for this dedupDb?
Appreciate your advice if this is a good idea or if it would add a bunch of scope. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The drawback of proposed approach is that the existing database will change during the process or even will be in unhealthy state (if some error occurs during process).
In DeduplicationDb we are saving only labels - I hope the change will not increase memory consumption.
Probably this is good change for user experience but may take some time to implement and test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decision is to not combine the commands in this PR.
| */ | ||
| export async function convertCsvCommand(options: ConvertCsvCommandOptions): Promise<void> { | ||
| // Validate that existingDbPath is provided when labelSetVersion > 0 | ||
| if (options.labelSetVersion > 0 && !options.existingDbPath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my other feedback suggesting that we remove the labelSetVersion option from this command as it can be determined dynamically through the existingDbPath
| import { ENSRainbowDB } from "@/lib/database"; | ||
|
|
||
| import { convertCsvCommand } from "./convert-csv-command"; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super work on these tests 🚀
convert-csvcommand for converting CSV files to .ensrainbow format.fast-csvfor CSV parsing.