Skip to content
This repository was archived by the owner on Feb 18, 2025. It is now read-only.
This repository was archived by the owner on Feb 18, 2025. It is now read-only.

AAN dataset crashing when loading .tsv file #53

@exnx

Description

@exnx

Did anyone else have issues loading the AAN dataset into memory? In particular when I load the .tsv file into memory, it crashes :/ I used several different instances on Google Cloud, with varying amount of memory, up to 170G, 24 cpus, but it still crashed. I feel like I am missing something. Here's my snippet of code that crashes the instance every time.

from datasets import DatasetDict, Value, load_dataset
...

        dataset = load_dataset(
            "csv",
            data_files={
                "train": str(self.data_dir / "new_aan_pairs.train.tsv"),  # 8G file
                "val": str(self.data_dir / "new_aan_pairs.eval.tsv"),
                "test": str(self.data_dir / "new_aan_pairs.test.tsv"),
            },
            delimiter="\t",
            column_names=["label", "input1_id", "input2_id", "text1", "text2"],
            keep_in_memory=True,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions