Adding codebleu as code similarity metric (#116) by matthieumeeus · Pull Request #116 · facebookresearch/PrivacyGuard

matthieumeeus · 2026-03-23T20:56:01Z

Summary:

Adding CodeBLEU distance metric from https://arxiv.org/pdf/2009.10297 to PrivacyGuard, completing the code memorization measurement pipeline. We borrow substantial functionality from the CodeBLEU package made available on pypi (version V0.6.0) (https://pypi.org/project/codebleu/0.6.0/) and adopt it to our framework.

This diff introduces:

CodeBleuAttack: A new BaseAttack that prepares target and generated code for CodeBLEU similarity analysis. Parses code into tokens, ASTs (via tree-sitter), and normalized data flow graphs (DFGs). Supports multiple languages (Python, Java, JavaScript, Go, Ruby, Rust, C, C++, C#, PHP) with per-language keyword weighting. Caches parsers and keywords for efficiency.
CodeBleuNode: A new BaseAnalysisNode that computes CodeBLEU similarity between code pairs produced by CodeBleuAttack. Implements the composite metric from Ren et al. 2020 as a weighted sum of four components: (i) ngram_match: Standard BLEU score measuring n-gram overlap; (ii) weighted_ngram_match: BLEU with reduced weight (0.2) for non-keyword tokens; (iii) syntax_match: Fraction of target AST subtrees found in generated AST; (iv) dataflow_match: Fraction of target DFG edges found in generated code.
CodeBleuNodeOutput: A BaseAnalysisOutput dataclass with fields for num_samples, per_sample_code_bleu, avg_code_bleu, and optional avg_code_bleu_by_language.
CodeBleuAnalysisInput: A BaseAnalysisInput class that validates required columns (tokens, ASTs, DFGs) produced by the attack.

Differential Revision: D96365690

meta-codesync · 2026-03-23T20:56:07Z

@matthieumeeus has exported this pull request. If you are a Meta employee, you can view the originating Diff in D96365690.

Summary: Adding CodeBLEU distance metric from https://arxiv.org/pdf/2009.10297 to PrivacyGuard, completing the code memorization measurement pipeline. We borrow substantial functionality from the CodeBLEU package made available on pypi (version V0.6.0) (https://pypi.org/project/codebleu/0.6.0/) and adopt it to our framework. This diff introduces: - `CodeBleuAttack`: A new BaseAttack that prepares target and generated code for CodeBLEU similarity analysis. Parses code into tokens, ASTs (via tree-sitter), and normalized data flow graphs (DFGs). Supports multiple languages (Python, Java, JavaScript, Go, Ruby, Rust, C, C++, C#, PHP) with per-language keyword weighting. Caches parsers and keywords for efficiency. - `CodeBleuNode`: A new BaseAnalysisNode that computes CodeBLEU similarity between code pairs produced by CodeBleuAttack. Implements the composite metric from Ren et al. 2020 as a weighted sum of four components: (i) ngram_match: Standard BLEU score measuring n-gram overlap; (ii) weighted_ngram_match: BLEU with reduced weight (0.2) for non-keyword tokens; (iii) syntax_match: Fraction of target AST subtrees found in generated AST; (iv) dataflow_match: Fraction of target DFG edges found in generated code. - `CodeBleuNodeOutput`: A BaseAnalysisOutput dataclass with fields for num_samples, per_sample_code_bleu, avg_code_bleu, and optional avg_code_bleu_by_language. - `CodeBleuAnalysisInput`: A BaseAnalysisInput class that validates required columns (tokens, ASTs, DFGs) produced by the attack. Differential Revision: D96365690

Summary: Pull Request resolved: facebookresearch#116 Adding CodeBLEU distance metric from https://arxiv.org/pdf/2009.10297 to PrivacyGuard, completing the code memorization measurement pipeline. We borrow substantial functionality from the CodeBLEU package made available on pypi (version V0.6.0) (https://pypi.org/project/codebleu/0.6.0/) and adopt it to our framework. This diff introduces: - `CodeBleuAttack`: A new BaseAttack that prepares target and generated code for CodeBLEU similarity analysis. Parses code into tokens, ASTs (via tree-sitter), and normalized data flow graphs (DFGs). Supports multiple languages (Python, Java, JavaScript, Go, Ruby, Rust, C, C++, C#, PHP) with per-language keyword weighting. Caches parsers and keywords for efficiency. - `CodeBleuNode`: A new BaseAnalysisNode that computes CodeBLEU similarity between code pairs produced by CodeBleuAttack. Implements the composite metric from Ren et al. 2020 as a weighted sum of four components: (i) ngram_match: Standard BLEU score measuring n-gram overlap; (ii) weighted_ngram_match: BLEU with reduced weight (0.2) for non-keyword tokens; (iii) syntax_match: Fraction of target AST subtrees found in generated AST; (iv) dataflow_match: Fraction of target DFG edges found in generated code. - `CodeBleuNodeOutput`: A BaseAnalysisOutput dataclass with fields for num_samples, per_sample_code_bleu, avg_code_bleu, and optional avg_code_bleu_by_language. - `CodeBleuAnalysisInput`: A BaseAnalysisInput class that validates required columns (tokens, ASTs, DFGs) produced by the attack. Differential Revision: D96365690

Summary: Adding CodeBLEU distance metric from https://arxiv.org/pdf/2009.10297 to PrivacyGuard, completing the code memorization measurement pipeline. We borrow substantial functionality from the CodeBLEU package made available on pypi (version V0.6.0) (https://pypi.org/project/codebleu/0.6.0/) and adopt it to our framework. This diff introduces: - `CodeBleuAttack`: A new BaseAttack that prepares target and generated code for CodeBLEU similarity analysis. Parses code into tokens, ASTs (via tree-sitter), and normalized data flow graphs (DFGs). Supports multiple languages (Python, Java, JavaScript, Go, Ruby, Rust, C, C++, C#, PHP) with per-language keyword weighting. Caches parsers and keywords for efficiency. - `CodeBleuNode`: A new BaseAnalysisNode that computes CodeBLEU similarity between code pairs produced by CodeBleuAttack. Implements the composite metric from Ren et al. 2020 as a weighted sum of four components: (i) ngram_match: Standard BLEU score measuring n-gram overlap; (ii) weighted_ngram_match: BLEU with reduced weight (0.2) for non-keyword tokens; (iii) syntax_match: Fraction of target AST subtrees found in generated AST; (iv) dataflow_match: Fraction of target DFG edges found in generated code. - `CodeBleuNodeOutput`: A BaseAnalysisOutput dataclass with fields for num_samples, per_sample_code_bleu, avg_code_bleu, and optional avg_code_bleu_by_language. - `CodeBleuAnalysisInput`: A BaseAnalysisInput class that validates required columns (tokens, ASTs, DFGs) produced by the attack. Differential Revision: D96365690

meta-codesync · 2026-03-26T14:13:46Z

This pull request has been merged in e4fdc0e.

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 23, 2026

meta-codesync Bot added fb-exported meta-exported labels Mar 23, 2026

meta-codesync Bot changed the title ~~Adding codebleu as code similarity metric~~ Adding codebleu as code similarity metric (#116) Mar 24, 2026

matthieumeeus force-pushed the export-D96365690 branch from e936f58 to 9d9e33f Compare March 24, 2026 14:11

matthieumeeus force-pushed the export-D96365690 branch from 9d9e33f to f2fbd38 Compare March 24, 2026 19:29

matthieumeeus force-pushed the export-D96365690 branch from f2fbd38 to 5721647 Compare March 24, 2026 22:14

matthieumeeus force-pushed the export-D96365690 branch 2 times, most recently from b22705a to da8b706 Compare March 25, 2026 15:25

matthieumeeus force-pushed the export-D96365690 branch from da8b706 to 98b9bdb Compare March 25, 2026 15:58

matthieumeeus force-pushed the export-D96365690 branch from 98b9bdb to a2df9e7 Compare March 25, 2026 16:11

meta-codesync Bot closed this in e4fdc0e Mar 26, 2026

facebook-github-tools Bot added the Merged label Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding codebleu as code similarity metric (#116)#116

Adding codebleu as code similarity metric (#116)#116
matthieumeeus wants to merge 1 commit into
facebookresearch:mainfrom
matthieumeeus:export-D96365690

matthieumeeus commented Mar 23, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Mar 23, 2026

Uh oh!

meta-codesync Bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matthieumeeus commented Mar 23, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Mar 23, 2026

Uh oh!

meta-codesync Bot commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

matthieumeeus commented Mar 23, 2026 •

edited by meta-codesync Bot

Loading