Location>code7788 >text

Smart resume parser practical tutorial: build an automated talent screening system based on Spacy+Flask

Popularity:468 ℃/2025-04-16 20:44:28

1. Project background and technical selection

In the field of human resources, HR teams that need to process hundreds of resumes every day face huge challenges: inefficient manual screening, high risk of missing key information, and difficulty in cross-document comparison analysis. This tutorial will build an end-to-end intelligent resume analysis system, automatically extract the core information of candidates through NLP technology, and combine it with Web services to achieve visual display.

Technology stack analysis

Components Functional positioning Alternatives
PDFPlumber PDF text extraction PyPDF2、camelot
spaCy Entity recognition and NLP processing NLTK、Transformers
Flask Web Service Framework FastAPI、Django
Front-end display (optional) React、Angular

2. System architecture design

graph TD A[User uploading PDF resume] --> B{Flask backend} B --> C[PDF parsing module] C --> D[Text Preprocessing] D --> E[entity recognition model] E --> F[Key Information Extraction] F --> G[Database Storage] G --> H[front-end display] style B fill:#4CAF50,color:white style E fill:#2196F3,color:white

3. Detailed explanation of the implementation of core modules

3.1 PDF parsing layer (PDFPlumber)

# pdf_parser.py
 import pdfplumber
 
 def extract_text(pdf_path):
     text = ""
     with (pdf_path) as pdf:
         for page in:
             text += page.extract_text() + "\n"
     return clean_text(text)
 
 def clean_text(raw_text):
     # Remove special characters and extra spaces
     import re
     text = (r'[\x00-\x1F]+', ' ', raw_text)
     text = (r'\s+', ' ', text).strip()
     return text

Advanced processing skills

  1. Process scanned PDF: Integrated Tesseract OCR;
  2. Table data extraction: Useextract_tables()method;
  3. Layout analysis: BycharsObject gets text coordinates.

3.2 NLP processing layer (spaCy)

3.2.1 Custom entity recognition model training

  1. Prepare labeled data (JSON format example):
[
   {
     "text": "Zhang San graduated from Peking University's Computer Science and Technology in 2018",
     "entities": [
       {"start": 0, "end": 2, "label": "NAME"},
       {"start": 5, "end": 9, "label": "GRAD_YEAR"},
       {"start": 12, "end": 16, "label": "EDU_ORG"},
       {"start": 16, "end": 24, "label": "MAJOR"}
     ]
   }
 ]

2. Training process code:

# train_ner.py
 import spacy
 from import minibatch, compounding
 
 def train_model(train_data, output_dir, n_iter=20):
     nlp = ("zh_core_web_sm") # Chinese model
     if "ner" not in nlp.pipe_names:
         ner = nlp.create_pipe("ner")
         nlp.add_pipe(ner, last=True)
    
     # Add tags
     for _, annotations in train_data:
         for ent in ("entities"):
             ner.add_label(ent[2])
 
     # Training configuration
     other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
     with nlp.disable_pipes(*other_pipes):
         optimizer = nlp.begin_training()
         for i in range(n_iter):
             losses = {}
             batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
             for batch in batches:
                 texts, annotations = zip(*batch)
                 (
                     texts,
                     annotations,
                     drop=0.5,
                     sgd=optimizer,
                     losses=losses
                 )
             print(f"Losses at iteration {i}: {losses}")
 
     nlp.to_disk(output_dir)
     print("Model saved!")

3.2.2 Keyword matching algorithm

# keyword_matcher.py
 from import Matcher
 
 def create_matcher(nlp):
     matcher = Matcher()
    
     # Skill keyword pattern
     skill_patterns = [
         [{"ENT_TYPE": "SKILL"}, {"OP": "+", "ENT_TYPE": "SKILL"}],
         [{"ENT_TYPE": "SKILL"}]
     ]
    
     # Educational background model
     edu_patterns = [
         [{"ENT_TYPE": "EDU_ORG"}, {"ENT_TYPE": "MAJOR"}],
         [{"ENT_TYPE": "GRAD_YEAR"}]
     ]
    
     ("SKILL_MATCH", None, *skill_patterns)
     ("EDU_MATCH", None, *edu_patterns)
     Return matcher

3.3 Web Service Layer (Flask)

#
 from flask import Flask, request, jsonify
 import pdfplumber
 import spacy
 
 app = Flask(__name__)
 
 # Loading the model
 nlp = ("trained_model")
 matcher = create_matcher(nlp)
 
 @('/parse', methods=['POST'])
 def parse_resume():
     if 'file' not in :
         return jsonify({"error": "No file uploaded"}), 400
    
     file = ['file']
     if ('.')[-1].lower() != 'pdf':
         return jsonify({"error": "Only PDF files allowed"}), 400
    
     # Save temporary files
     import tempfile
     with (delete=True) as tmp:
         ()
        
         # parse PDF
         text = extract_text()
        
         #NLP processing
         doc = nlp(text)
         matches = matcher(doc)
        
         # Result Extraction
         results = {
             "name": get_name(),
             "skills": extract_skills(, matches),
             "education": extract_education(, matches)
         }
        
     return jsonify(results)
 
 def get_name(entities):
     for ent in entities:
         if ent.label_ == "NAME":
             Return
     return "not recognized"
 
 if __name__ == '__main__':
     (debug=True)

4. System optimization and expansion

4.1 Performance optimization strategy

  1. Asynchronous processing: Use Celery to handle time-consuming tasks;
  2. Cache mechanism: Redis caches commonly used parsing results;
  3. Model quantization: Use spacy-transformers to transform the model.

4.2 Function expansion direction

  1. Multilingual support: Integrated multilingual model;
  2. Resume check: Implement the SimHash algorithm to detect duplication;
  3. Smart recommendation: Based on skills matching job needs.

5. Complete code deployment guide

5.1 Environmental Preparation

# Create a virtual environment
 python -m venv venv
 source venv/bin/activate
 
 # Installation dependencies
 pip install flask spacy pdfplumber
 python -m spacy download zh_core_web_sm

5.2 Operation process

  1. Prepare labeled data (at least 50 pieces);
  2. Training the model:python train_ner.py output_model
  3. Start the service:python
  4. Front-end call example:
<input type="file" accept=".pdf">
 <div >/div>
 
 <script>
 ('resumeUpload').addEventListener('change', function(e) {
   const file = [0];
   const formData = new FormData();
   ('file', file);
 
   fetch('/parse', {
     method: 'POST',
     body: formData
   })
   .then(response => ())
   .then(data => {
     const resultsDiv = ('results');
      = `
       <h3>Cannotate information:</h3>
       <p>Name: ${}</p>
       <p>Skill: ${(', ')}</p>
       <p>Educational background: ${}</p>
     `;
   });
 });
 </script>

6. Frequently Asked Questions

6.1 PDF parsing failed

  1. Check whether the file is a scanned copy (requires OCR processing);
  2. Try different parsing engines:
#Use layout analysis
 with (pdf_path) as pdf:
     page = [0]
     text = page.extract_text(layout=True)

6.2 Insufficient entity recognition accuracy

  1. Increase the number of labeled data (at least 500 pieces are recommended);
  2. Optimize labeling using active learning methods;
  3. Try transfer learning:
# Use pre-trained models to fine tune
 nlp = ("zh_core_web_trf")

7. Conclusion and Outlook

This tutorial builds a complete process from PDF parsing to Web services. In the actual production environment, elements such as distributed processing, continuous model training, and security auditing should be considered. With the development of large language models, LLMs can be integrated in the future to achieve more complex information inference, such as inferring candidate capability maps from project experiences.

Through the practice of this project, developers can master:

  1. The entire process of NLP engineering;
  2. PDF parsing best practices;
  3. Web service API design;
  4. Model training and tuning methods;

It is recommended to start with simple scenarios, gradually iterate and optimize, and finally build an intelligent resume analysis system that meets business needs.