Skip to content

Implement intelligent NLP-based resume parsing with entity extraction and skill recognition #80

Description

@anshul23102

Description

Current resume parser uses basic regex patterns, missing 40% of skills, experiences, and education. NLP-based extraction with skill ontology mapping would accurately identify competencies, enabling proper job matching and skill gap analysis.

Current Impact: 60% extraction accuracy, many skills missed, resulting in poor job matches.

Expected Business Value: 70% improvement in resume understanding, 85% higher match quality, expanded to 15+ skill categories from current 3.

Steps to Reproduce

  1. Upload resume with diverse skills (e.g., 'TensorFlow', 'full-stack development', 'AWS DevOps')
  2. Check extracted skills
  3. Observe: many skills missed, typos not recognized, related skills not grouped

Environment Information

  • Python 3.8+
  • NLTK/spaCy available
  • Node.js for UI (if applicable)
  • Test data: 50 sample resumes

Expected Behavior

  • Extracts 90%+ of skills from diverse resumes
  • Recognizes skill variations (e.g., 'react' and 'reactjs' as same)
  • Groups related skills (Python, Java -> Languages)
  • Confidence scores for each extraction
  • Handles typos and abbreviations
  • Supports 50+ skill categories

Actual Behavior

  • Only 60% extraction accuracy
  • Regex-only approach misses many skills
  • No synonym/variant recognition
  • No skill grouping or categorization
  • No confidence metrics

Screenshots or Recordings

Not applicable - parsing logic missing

Additional Context

Affected Users: Job seekers with diverse backgrounds; tech skills not properly recognized.

Root Cause: Regex-based extraction too simplistic for skill diversity.

Proposed Solution: Use NLP entity recognition (spaCy) plus skill ontology database matching.

Implementation Steps:

  1. Build skill ontology (skills.json) with 500+ entries, variants, categories
  2. Integrate spaCy NER for entity recognition
  3. Implement skill entity linking to ontology
  4. Add fuzzy matching for typos (Levenshtein distance)
  5. Implement skill grouping logic
  6. Add confidence scoring mechanism
  7. Create REST endpoint: POST /parse-resume returns JSON

Test Cases:

  • Resume 1 (tech): extracts Python, JavaScript, AWS, Docker (expect all 4)
  • Resume 2 (typos): 'Pyton', 'React.Js' (expect recognized as Python, React)
  • Resume 3 (synonyms): 'full-stack', 'fullstack', 'full stack' (expect grouped as same)
  • Resume 4 (edge): 25 skills mentioned (expect 90%+ accuracy)
  • Confidence: scored skills >0.8 confidence, low-confidence flagged for review
  • Performance: parse resume <2 seconds

Severity: High - critical for feature accuracy
Expected Points: 500-600 GSSoC points

Suggested Labels

enhancement, nlp, parsing, skill-extraction, ml, resume-analysis, GSSoC26

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions