Duplicates in Output Files

Dear @SKruthoff and @ysherstyuk, 

Thanks a lot for your work on this! 
 @Tilmon noticed the following in the emission profiles_company

> Duplicates: running dplyr::distinct() on the datasets emission_profile_company.csv, emission_profile_product.csv, emission_profile_upstream_at_company_level.csv shows that all these 3 datasets have duplicates. Only tested for these 3. All datasets should be tested for duplications and duplications avoided. E.g. the companies_id "adolf-wurth-gmbh-co-kg_00000004971238-001" has all rows twice in the emission_profile_product.csv.

Could you double check if there is a quality check included that would avoid this? And do you know where the duplicates come from? Is this an issue in the code from GitHub or is there something happening on DataBricks that makes this mistake? If it is due to the code on GitHub we would need to investigate where this comes from. 

Best
Anne

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicates in Output Files #115

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Duplicates in Output Files #115

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions