Skip to content

Adding codebleu as code similarity metric (#116)#116

Closed
matthieumeeus wants to merge 1 commit into
facebookresearch:mainfrom
matthieumeeus:export-D96365690
Closed

Adding codebleu as code similarity metric (#116)#116
matthieumeeus wants to merge 1 commit into
facebookresearch:mainfrom
matthieumeeus:export-D96365690

Conversation

@matthieumeeus
Copy link
Copy Markdown

@matthieumeeus matthieumeeus commented Mar 23, 2026

Summary:

Adding CodeBLEU distance metric from https://arxiv.org/pdf/2009.10297 to PrivacyGuard, completing the code memorization measurement pipeline. We borrow substantial functionality from the CodeBLEU package made available on pypi (version V0.6.0) (https://pypi.org/project/codebleu/0.6.0/) and adopt it to our framework.

This diff introduces:

  • CodeBleuAttack: A new BaseAttack that prepares target and generated code for CodeBLEU similarity analysis. Parses code into tokens, ASTs (via tree-sitter), and normalized data flow graphs (DFGs). Supports multiple languages (Python, Java, JavaScript, Go, Ruby, Rust, C, C++, C#, PHP) with per-language keyword weighting. Caches parsers and keywords for efficiency.

  • CodeBleuNode: A new BaseAnalysisNode that computes CodeBLEU similarity between code pairs produced by CodeBleuAttack. Implements the composite metric from Ren et al. 2020 as a weighted sum of four components: (i) ngram_match: Standard BLEU score measuring n-gram overlap; (ii) weighted_ngram_match: BLEU with reduced weight (0.2) for non-keyword tokens; (iii) syntax_match: Fraction of target AST subtrees found in generated AST; (iv) dataflow_match: Fraction of target DFG edges found in generated code.

  • CodeBleuNodeOutput: A BaseAnalysisOutput dataclass with fields for num_samples, per_sample_code_bleu, avg_code_bleu, and optional avg_code_bleu_by_language.

  • CodeBleuAnalysisInput: A BaseAnalysisInput class that validates required columns (tokens, ASTs, DFGs) produced by the attack.

Differential Revision: D96365690

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 23, 2026

@matthieumeeus has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96365690.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 23, 2026
@meta-codesync meta-codesync Bot changed the title Adding codebleu as code similarity metric Adding codebleu as code similarity metric (#116) Mar 24, 2026
matthieumeeus pushed a commit to matthieumeeus/PrivacyGuard that referenced this pull request Mar 24, 2026
Summary:

Adding CodeBLEU distance metric from https://arxiv.org/pdf/2009.10297 to PrivacyGuard, completing the code memorization measurement pipeline. We borrow substantial functionality from the CodeBLEU package made available on pypi (version V0.6.0) (https://pypi.org/project/codebleu/0.6.0/) and adopt it to our framework.

This diff introduces:
- `CodeBleuAttack`: A new BaseAttack that prepares target and generated code for CodeBLEU similarity analysis. Parses code into tokens, ASTs (via tree-sitter), and normalized data flow graphs (DFGs). Supports multiple languages (Python, Java, JavaScript, Go, Ruby, Rust, C, C++, C#, PHP) with per-language keyword weighting. Caches parsers and keywords for efficiency.

- `CodeBleuNode`: A new BaseAnalysisNode that computes CodeBLEU similarity between code pairs produced by CodeBleuAttack. Implements the composite metric from Ren et al. 2020 as a weighted sum of four components: (i) ngram_match: Standard BLEU score measuring n-gram overlap; (ii) weighted_ngram_match: BLEU with reduced weight (0.2) for non-keyword tokens; (iii) syntax_match: Fraction of target AST subtrees found in generated AST; (iv) dataflow_match: Fraction of target DFG edges found in generated code.

- `CodeBleuNodeOutput`: A BaseAnalysisOutput dataclass with fields for num_samples, per_sample_code_bleu, avg_code_bleu, and optional avg_code_bleu_by_language.

- `CodeBleuAnalysisInput`: A BaseAnalysisInput class that validates required columns (tokens, ASTs, DFGs) produced by the attack.

Differential Revision: D96365690
matthieumeeus pushed a commit to matthieumeeus/PrivacyGuard that referenced this pull request Mar 24, 2026
Summary:

Adding CodeBLEU distance metric from https://arxiv.org/pdf/2009.10297 to PrivacyGuard, completing the code memorization measurement pipeline. We borrow substantial functionality from the CodeBLEU package made available on pypi (version V0.6.0) (https://pypi.org/project/codebleu/0.6.0/) and adopt it to our framework.

This diff introduces:
- `CodeBleuAttack`: A new BaseAttack that prepares target and generated code for CodeBLEU similarity analysis. Parses code into tokens, ASTs (via tree-sitter), and normalized data flow graphs (DFGs). Supports multiple languages (Python, Java, JavaScript, Go, Ruby, Rust, C, C++, C#, PHP) with per-language keyword weighting. Caches parsers and keywords for efficiency.

- `CodeBleuNode`: A new BaseAnalysisNode that computes CodeBLEU similarity between code pairs produced by CodeBleuAttack. Implements the composite metric from Ren et al. 2020 as a weighted sum of four components: (i) ngram_match: Standard BLEU score measuring n-gram overlap; (ii) weighted_ngram_match: BLEU with reduced weight (0.2) for non-keyword tokens; (iii) syntax_match: Fraction of target AST subtrees found in generated AST; (iv) dataflow_match: Fraction of target DFG edges found in generated code.

- `CodeBleuNodeOutput`: A BaseAnalysisOutput dataclass with fields for num_samples, per_sample_code_bleu, avg_code_bleu, and optional avg_code_bleu_by_language.

- `CodeBleuAnalysisInput`: A BaseAnalysisInput class that validates required columns (tokens, ASTs, DFGs) produced by the attack.

Differential Revision: D96365690
matthieumeeus pushed a commit to matthieumeeus/PrivacyGuard that referenced this pull request Mar 24, 2026
Summary:
Pull Request resolved: facebookresearch#116

Adding CodeBLEU distance metric from https://arxiv.org/pdf/2009.10297 to PrivacyGuard, completing the code memorization measurement pipeline. We borrow substantial functionality from the CodeBLEU package made available on pypi (version V0.6.0) (https://pypi.org/project/codebleu/0.6.0/) and adopt it to our framework.

This diff introduces:
- `CodeBleuAttack`: A new BaseAttack that prepares target and generated code for CodeBLEU similarity analysis. Parses code into tokens, ASTs (via tree-sitter), and normalized data flow graphs (DFGs). Supports multiple languages (Python, Java, JavaScript, Go, Ruby, Rust, C, C++, C#, PHP) with per-language keyword weighting. Caches parsers and keywords for efficiency.

- `CodeBleuNode`: A new BaseAnalysisNode that computes CodeBLEU similarity between code pairs produced by CodeBleuAttack. Implements the composite metric from Ren et al. 2020 as a weighted sum of four components: (i) ngram_match: Standard BLEU score measuring n-gram overlap; (ii) weighted_ngram_match: BLEU with reduced weight (0.2) for non-keyword tokens; (iii) syntax_match: Fraction of target AST subtrees found in generated AST; (iv) dataflow_match: Fraction of target DFG edges found in generated code.

- `CodeBleuNodeOutput`: A BaseAnalysisOutput dataclass with fields for num_samples, per_sample_code_bleu, avg_code_bleu, and optional avg_code_bleu_by_language.

- `CodeBleuAnalysisInput`: A BaseAnalysisInput class that validates required columns (tokens, ASTs, DFGs) produced by the attack.

Differential Revision: D96365690
matthieumeeus pushed a commit to matthieumeeus/PrivacyGuard that referenced this pull request Mar 25, 2026
Summary:

Adding CodeBLEU distance metric from https://arxiv.org/pdf/2009.10297 to PrivacyGuard, completing the code memorization measurement pipeline. We borrow substantial functionality from the CodeBLEU package made available on pypi (version V0.6.0) (https://pypi.org/project/codebleu/0.6.0/) and adopt it to our framework.

This diff introduces:
- `CodeBleuAttack`: A new BaseAttack that prepares target and generated code for CodeBLEU similarity analysis. Parses code into tokens, ASTs (via tree-sitter), and normalized data flow graphs (DFGs). Supports multiple languages (Python, Java, JavaScript, Go, Ruby, Rust, C, C++, C#, PHP) with per-language keyword weighting. Caches parsers and keywords for efficiency.

- `CodeBleuNode`: A new BaseAnalysisNode that computes CodeBLEU similarity between code pairs produced by CodeBleuAttack. Implements the composite metric from Ren et al. 2020 as a weighted sum of four components: (i) ngram_match: Standard BLEU score measuring n-gram overlap; (ii) weighted_ngram_match: BLEU with reduced weight (0.2) for non-keyword tokens; (iii) syntax_match: Fraction of target AST subtrees found in generated AST; (iv) dataflow_match: Fraction of target DFG edges found in generated code.

- `CodeBleuNodeOutput`: A BaseAnalysisOutput dataclass with fields for num_samples, per_sample_code_bleu, avg_code_bleu, and optional avg_code_bleu_by_language.

- `CodeBleuAnalysisInput`: A BaseAnalysisInput class that validates required columns (tokens, ASTs, DFGs) produced by the attack.

Differential Revision: D96365690
@matthieumeeus matthieumeeus force-pushed the export-D96365690 branch 2 times, most recently from b22705a to da8b706 Compare March 25, 2026 15:25
matthieumeeus pushed a commit to matthieumeeus/PrivacyGuard that referenced this pull request Mar 25, 2026
Summary:

Adding CodeBLEU distance metric from https://arxiv.org/pdf/2009.10297 to PrivacyGuard, completing the code memorization measurement pipeline. We borrow substantial functionality from the CodeBLEU package made available on pypi (version V0.6.0) (https://pypi.org/project/codebleu/0.6.0/) and adopt it to our framework.

This diff introduces:
- `CodeBleuAttack`: A new BaseAttack that prepares target and generated code for CodeBLEU similarity analysis. Parses code into tokens, ASTs (via tree-sitter), and normalized data flow graphs (DFGs). Supports multiple languages (Python, Java, JavaScript, Go, Ruby, Rust, C, C++, C#, PHP) with per-language keyword weighting. Caches parsers and keywords for efficiency.

- `CodeBleuNode`: A new BaseAnalysisNode that computes CodeBLEU similarity between code pairs produced by CodeBleuAttack. Implements the composite metric from Ren et al. 2020 as a weighted sum of four components: (i) ngram_match: Standard BLEU score measuring n-gram overlap; (ii) weighted_ngram_match: BLEU with reduced weight (0.2) for non-keyword tokens; (iii) syntax_match: Fraction of target AST subtrees found in generated AST; (iv) dataflow_match: Fraction of target DFG edges found in generated code.

- `CodeBleuNodeOutput`: A BaseAnalysisOutput dataclass with fields for num_samples, per_sample_code_bleu, avg_code_bleu, and optional avg_code_bleu_by_language.

- `CodeBleuAnalysisInput`: A BaseAnalysisInput class that validates required columns (tokens, ASTs, DFGs) produced by the attack.

Differential Revision: D96365690
matthieumeeus pushed a commit to matthieumeeus/PrivacyGuard that referenced this pull request Mar 25, 2026
Summary:

Adding CodeBLEU distance metric from https://arxiv.org/pdf/2009.10297 to PrivacyGuard, completing the code memorization measurement pipeline. We borrow substantial functionality from the CodeBLEU package made available on pypi (version V0.6.0) (https://pypi.org/project/codebleu/0.6.0/) and adopt it to our framework.

This diff introduces:
- `CodeBleuAttack`: A new BaseAttack that prepares target and generated code for CodeBLEU similarity analysis. Parses code into tokens, ASTs (via tree-sitter), and normalized data flow graphs (DFGs). Supports multiple languages (Python, Java, JavaScript, Go, Ruby, Rust, C, C++, C#, PHP) with per-language keyword weighting. Caches parsers and keywords for efficiency.

- `CodeBleuNode`: A new BaseAnalysisNode that computes CodeBLEU similarity between code pairs produced by CodeBleuAttack. Implements the composite metric from Ren et al. 2020 as a weighted sum of four components: (i) ngram_match: Standard BLEU score measuring n-gram overlap; (ii) weighted_ngram_match: BLEU with reduced weight (0.2) for non-keyword tokens; (iii) syntax_match: Fraction of target AST subtrees found in generated AST; (iv) dataflow_match: Fraction of target DFG edges found in generated code.

- `CodeBleuNodeOutput`: A BaseAnalysisOutput dataclass with fields for num_samples, per_sample_code_bleu, avg_code_bleu, and optional avg_code_bleu_by_language.

- `CodeBleuAnalysisInput`: A BaseAnalysisInput class that validates required columns (tokens, ASTs, DFGs) produced by the attack.

Differential Revision: D96365690
Summary:

Adding CodeBLEU distance metric from https://arxiv.org/pdf/2009.10297 to PrivacyGuard, completing the code memorization measurement pipeline. We borrow substantial functionality from the CodeBLEU package made available on pypi (version V0.6.0) (https://pypi.org/project/codebleu/0.6.0/) and adopt it to our framework.

This diff introduces:
- `CodeBleuAttack`: A new BaseAttack that prepares target and generated code for CodeBLEU similarity analysis. Parses code into tokens, ASTs (via tree-sitter), and normalized data flow graphs (DFGs). Supports multiple languages (Python, Java, JavaScript, Go, Ruby, Rust, C, C++, C#, PHP) with per-language keyword weighting. Caches parsers and keywords for efficiency.

- `CodeBleuNode`: A new BaseAnalysisNode that computes CodeBLEU similarity between code pairs produced by CodeBleuAttack. Implements the composite metric from Ren et al. 2020 as a weighted sum of four components: (i) ngram_match: Standard BLEU score measuring n-gram overlap; (ii) weighted_ngram_match: BLEU with reduced weight (0.2) for non-keyword tokens; (iii) syntax_match: Fraction of target AST subtrees found in generated AST; (iv) dataflow_match: Fraction of target DFG edges found in generated code.

- `CodeBleuNodeOutput`: A BaseAnalysisOutput dataclass with fields for num_samples, per_sample_code_bleu, avg_code_bleu, and optional avg_code_bleu_by_language.

- `CodeBleuAnalysisInput`: A BaseAnalysisInput class that validates required columns (tokens, ASTs, DFGs) produced by the attack.

Differential Revision: D96365690
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Mar 26, 2026

This pull request has been merged in e4fdc0e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant