A multi-threaded ReAct agent system designed to extract responsibilities, requirements, and skills from job advertisements using deep-learning agent principles and dynamic tool calling. Extractions are compared to a skill taxonomy (ESCO).
Extracting responsibilities, requirements and skills from job advertisements can be fairly complicated given by the following challenges:
- Parsing Complexity:
- Job ads vary widely in quality and structure.
- No single parsing rule can reliably extract or organise the information from all ads.
- Job “Theme” Quality:
- The richness and quality of a job ad depend on its domain.
- For example, Trade & Services and Self-Employment ads can possess significantly less context when compared with other themes.
- Company “Lingo”:
- A company can use their own cultural lingo in the job ad.
- Skills, Responsibilities, and Requirements:
- Job ads may not contain all required fields to MAP, given the theme quality.
- Regional Variability & Cultural variation:
- Job ad terminology and expectations can differ across regions and the culture.
- Domain and jargon drift:
- Skill definitions evolve rapidly across domains.
- Example: Context Engineering is emerging with entirely new skill sets.
- Skill blurring:
- A skill can represent both hard and soft skills, depending on the context.
- Example: “Communication” may refer to a technical process (hard) or interpersonal ability (soft).
- Overlapping skills:
- Similar skills may have overlapping meanings.
- Example: “Charting” vs “Visualising” may describe the same capability in different contexts.
- Description columns define either an occupation or a skill.
- Descriptions don’t always align with actual requirements or responsibilities, which can shift with context.
- Example: The occupation Data Analyst can describe the responsibility to “Create visual reports to communicate insights” ➡️ What someone does
- Meanwhile, the skill Data Visualisation can be described as ”Designing visuals using tools” ➡️ What someone must know
- Occupation Blurring: The same occupation may be interpreted differently by companies and taxonomies, leading to contextual mismatches.
- ReAct agents fully exhaust a task before progressing to the next, ensuring thorough and coherent reasoning.
- Deep Learning ReAct agents implement the following principles:
- 🗃️ Planning: For example, creating a TODO list to plan and repeat the objective (reason why the agent decided to "tick off a task")
- 📝 Offload context: Capturing notes to assist the agent in accomplishing a task.
- 🤝 Task delegation: Can delegate tasks to sub-agents that specialise in the task at hand.
- 💬 Use Careful & Extensive Prompt Engineering: Building prompts to describe, constrain, and clarify subsequent processes to the Agent as a “system prompt” (see example here).
Due to their capacity to reason and act iteratively, powered by a strong LLM foundation, ReAct agents excel in interpreting nuanced, context-rich data. In job advertisement analysis, where skills, responsibilities, and requirements are often vague or context-dependent, ReAct agents overcome this by:
- Actively reasoning through the context of the job ad and the implied meaning.
- Decomposing ambiguous statements/job-ad sections into clear elements.
- Decomposing ambiguous statements/job-ad sections into clear elements.
| # | Name | purpose |
|---|---|---|
| 1 | update_content | A tool to extract raw context from a job advertisement. |
| 2 | write_todo | A tool to write concise TASKS to inform and track your progress. |
| 3 | read_todos | A tool to read the TODOs to remind the ReAct agent of the plan. |
| 4 | extract_soft_skills | A tool to extract soft skill entities from a job advertisement. |
| 5 | extract_hard_skills | A tool to extract hard skill entities from a job advertisement. |
| 6 | check_for_bothskills | A tool to validate and resolve overlapping skills (hard/soft) that were categorised as 'hard' or 'soft' skills. |
| 7 | extract_responsibilities | A tool to extract responsibilities from a job advertisement. |
| 8 | extract_requirements | A tool to extract requirements from a job advertisement. |
| 9 | evaluate_correctness | A tool to perform G-Eval for Correctness. |
- Entity extractins with LangExtract
- A Google tool optimised for long-document entity extraction.
- Extracts hard skills, soft skills, years of experience, and contact persons with high recall by using chunking, parallel processing, and multi-pass strategies.
- Context offloading with
InjectedToolCallIdandInjectedState. - Prompt base extraction tools
- Custom G-Eval functionality for Correctness using
Jinja2. - LLM hyperparameter optimisation: Optimising
recursion_limit&remaining_steps - Semantical evalution - Evaluating and comparing results between ReAct extractions and skill taxonomy.
- Embedding generation:
TechWolf/JobBERT-v2sentence-transformer model. - Comparison methodology: Cosine Similarity
- Leveraging
Datasets.map()utility
- Embedding generation:
- Deep agents princples: 🗃️ Planning, 📝 Offload context, 💬 Careful & Extensive Prompt Engineering
- Taxonomy Skill Concatenation:
- Purpose: Combine all possible labels for occupations and skills — including both preferred labels and alternative labels.
- Reason: Ensures that the evaluation considers all available labels in the taxonomy, capturing variations and synonyms in skill or occupation naming.
- ReAct – Threshold Alpha on “Correctness”:
- Correctness (GEval measure): Evaluates how well the agent’s output aligns with the expected results.
- In Context: Measures whether skills extracted from job ads are accurate and contextually correct.
- High Alignment Range: 0.8–1.0 (as per GEval definition).
- Filtering: Only extracted results with Correctness ≥ 0.8 are considered for evaluation to ensure the taxonomy mapping reflects only valid extractions.
- Cosine Similarity:
- Computes the similarity between extracted results and taxonomy embeddings.
- Maps each extracted skill or label to the K closest items in the embedding space.
- Ensures a semantic alignment between extracted results and the taxonomy, even if exact wording differs.
.
├── LICENSE
├── notebooks
│ ├── ads.ipynb
│ ├── data
│ │ ├── ads_preprocessed.csv
│ │ ├── ads-50k.json
│ │ ├── occupations_en.csv
│ │ ├── sample_results.csv
│ │ └── skills_en.csv
│ ├── output_images
│ │ ├── geval_class.png
│ │ ├── geval_correctness.png
│ │ ├── langextract_json.png
│ │ └── output.png
│ ├── ReAct.ipynb
│ ├── semantical_eval.ipynb
│ ├── softskills
│ │ ├── softskills.html
│ │ └── softskills.jsonl
│ └── utils
│ ├── __init__.py
│ ├── __pycache__
│ │ ├── __init__.cpython-312.pyc
│ │ └── print_utils.cpython-312.pyc
│ └── print_utils.py
├── pyproject.toml
├── README.md
├── src
│ ├── agent
│ │ ├── __init__.py
│ │ ├── evaluation.py
│ │ ├── preprocess_utils.py
│ │ ├── prompts.py
│ │ ├── req_and_res.py
│ │ ├── skill_utils.py
│ │ ├── state.py
│ │ ├── studio
│ │ │ ├── langgraph.json
│ │ │ ├── react.py
│ │ │ └── requirements.txt
│ │ └── todo_utils.py
│ └── data
│ ├── hard_skills.json
│ └── soft_skills.json
├── .gitignore
└── uv.lock