1. Project background and technical selection
In the field of human resources, HR teams that need to process hundreds of resumes every day face huge challenges: inefficient manual screening, high risk of missing key information, and difficulty in cross-document comparison analysis. This tutorial will build an end-to-end intelligent resume analysis system, automatically extract the core information of candidates through NLP technology, and combine it with Web services to achieve visual display.
Technology stack analysis
Components | Functional positioning | Alternatives |
---|---|---|
PDFPlumber | PDF text extraction | PyPDF2、camelot |
spaCy | Entity recognition and NLP processing | NLTK、Transformers |
Flask | Web Service Framework | FastAPI、Django |
Front-end display (optional) | React、Angular |
2. System architecture design
3. Detailed explanation of the implementation of core modules
3.1 PDF parsing layer (PDFPlumber)
# pdf_parser.py
import pdfplumber
def extract_text(pdf_path):
text = ""
with (pdf_path) as pdf:
for page in:
text += page.extract_text() + "\n"
return clean_text(text)
def clean_text(raw_text):
# Remove special characters and extra spaces
import re
text = (r'[\x00-\x1F]+', ' ', raw_text)
text = (r'\s+', ' ', text).strip()
return text
Advanced processing skills:
- Process scanned PDF: Integrated Tesseract OCR;
- Table data extraction: Use
extract_tables()
method; - Layout analysis: By
chars
Object gets text coordinates.
3.2 NLP processing layer (spaCy)
3.2.1 Custom entity recognition model training
- Prepare labeled data (JSON format example):
[
{
"text": "Zhang San graduated from Peking University's Computer Science and Technology in 2018",
"entities": [
{"start": 0, "end": 2, "label": "NAME"},
{"start": 5, "end": 9, "label": "GRAD_YEAR"},
{"start": 12, "end": 16, "label": "EDU_ORG"},
{"start": 16, "end": 24, "label": "MAJOR"}
]
}
]
2. Training process code:
# train_ner.py
import spacy
from import minibatch, compounding
def train_model(train_data, output_dir, n_iter=20):
nlp = ("zh_core_web_sm") # Chinese model
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
# Add tags
for _, annotations in train_data:
for ent in ("entities"):
ner.add_label(ent[2])
# Training configuration
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
for i in range(n_iter):
losses = {}
batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
(
texts,
annotations,
drop=0.5,
sgd=optimizer,
losses=losses
)
print(f"Losses at iteration {i}: {losses}")
nlp.to_disk(output_dir)
print("Model saved!")
3.2.2 Keyword matching algorithm
# keyword_matcher.py
from import Matcher
def create_matcher(nlp):
matcher = Matcher()
# Skill keyword pattern
skill_patterns = [
[{"ENT_TYPE": "SKILL"}, {"OP": "+", "ENT_TYPE": "SKILL"}],
[{"ENT_TYPE": "SKILL"}]
]
# Educational background model
edu_patterns = [
[{"ENT_TYPE": "EDU_ORG"}, {"ENT_TYPE": "MAJOR"}],
[{"ENT_TYPE": "GRAD_YEAR"}]
]
("SKILL_MATCH", None, *skill_patterns)
("EDU_MATCH", None, *edu_patterns)
Return matcher
3.3 Web Service Layer (Flask)
#
from flask import Flask, request, jsonify
import pdfplumber
import spacy
app = Flask(__name__)
# Loading the model
nlp = ("trained_model")
matcher = create_matcher(nlp)
@('/parse', methods=['POST'])
def parse_resume():
if 'file' not in :
return jsonify({"error": "No file uploaded"}), 400
file = ['file']
if ('.')[-1].lower() != 'pdf':
return jsonify({"error": "Only PDF files allowed"}), 400
# Save temporary files
import tempfile
with (delete=True) as tmp:
()
# parse PDF
text = extract_text()
#NLP processing
doc = nlp(text)
matches = matcher(doc)
# Result Extraction
results = {
"name": get_name(),
"skills": extract_skills(, matches),
"education": extract_education(, matches)
}
return jsonify(results)
def get_name(entities):
for ent in entities:
if ent.label_ == "NAME":
Return
return "not recognized"
if __name__ == '__main__':
(debug=True)
4. System optimization and expansion
4.1 Performance optimization strategy
- Asynchronous processing: Use Celery to handle time-consuming tasks;
- Cache mechanism: Redis caches commonly used parsing results;
- Model quantization: Use spacy-transformers to transform the model.
4.2 Function expansion direction
- Multilingual support: Integrated multilingual model;
- Resume check: Implement the SimHash algorithm to detect duplication;
- Smart recommendation: Based on skills matching job needs.
5. Complete code deployment guide
5.1 Environmental Preparation
# Create a virtual environment
python -m venv venv
source venv/bin/activate
# Installation dependencies
pip install flask spacy pdfplumber
python -m spacy download zh_core_web_sm
5.2 Operation process
- Prepare labeled data (at least 50 pieces);
- Training the model:
python train_ner.py output_model
; - Start the service:
python
。 - Front-end call example:
<input type="file" accept=".pdf">
<div >/div>
<script>
('resumeUpload').addEventListener('change', function(e) {
const file = [0];
const formData = new FormData();
('file', file);
fetch('/parse', {
method: 'POST',
body: formData
})
.then(response => ())
.then(data => {
const resultsDiv = ('results');
= `
<h3>Cannotate information:</h3>
<p>Name: ${}</p>
<p>Skill: ${(', ')}</p>
<p>Educational background: ${}</p>
`;
});
});
</script>
6. Frequently Asked Questions
6.1 PDF parsing failed
- Check whether the file is a scanned copy (requires OCR processing);
- Try different parsing engines:
#Use layout analysis
with (pdf_path) as pdf:
page = [0]
text = page.extract_text(layout=True)
6.2 Insufficient entity recognition accuracy
- Increase the number of labeled data (at least 500 pieces are recommended);
- Optimize labeling using active learning methods;
- Try transfer learning:
# Use pre-trained models to fine tune
nlp = ("zh_core_web_trf")
7. Conclusion and Outlook
This tutorial builds a complete process from PDF parsing to Web services. In the actual production environment, elements such as distributed processing, continuous model training, and security auditing should be considered. With the development of large language models, LLMs can be integrated in the future to achieve more complex information inference, such as inferring candidate capability maps from project experiences.
Through the practice of this project, developers can master:
- The entire process of NLP engineering;
- PDF parsing best practices;
- Web service API design;
- Model training and tuning methods;
It is recommended to start with simple scenarios, gradually iterate and optimize, and finally build an intelligent resume analysis system that meets business needs.