Skip to content

A multi-threaded ReAct agent system designed to extract responsibilities, requirements, and skills from job advertisements using deep-learning agent principles and dynamic tool calling.

License

Notifications You must be signed in to change notification settings

Kokolipa/react-jobads-extractions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LinkedIn Redit

❓ Project Description

A multi-threaded ReAct agent system designed to extract responsibilities, requirements, and skills from job advertisements using deep-learning agent principles and dynamic tool calling. Extractions are compared to a skill taxonomy (ESCO).

💡 Motivation

Extracting responsibilities, requirements and skills from job advertisements can be fairly complicated given by the following challenges:

Job ads

  1. Parsing Complexity:
    • Job ads vary widely in quality and structure.
    • No single parsing rule can reliably extract or organise the information from all ads.
  2. Job “Theme” Quality:
    • The richness and quality of a job ad depend on its domain.
    • For example, Trade & Services and Self-Employment ads can possess significantly less context when compared with other themes.
  3. Company “Lingo”:
    • A company can use their own cultural lingo in the job ad.
  4. Skills, Responsibilities, and Requirements:
    • Job ads may not contain all required fields to MAP, given the theme quality.
  5. Regional Variability & Cultural variation:
    • Job ad terminology and expectations can differ across regions and the culture.

Skills:

  1. Domain and jargon drift:
    • Skill definitions evolve rapidly across domains.
    • Example: Context Engineering is emerging with entirely new skill sets.
  2. Skill blurring:
    • A skill can represent both hard and soft skills, depending on the context.
    • Example: “Communication” may refer to a technical process (hard) or interpersonal ability (soft).
  3. Overlapping skills:
    • Similar skills may have overlapping meanings.
    • Example: “Charting” vs “Visualising” may describe the same capability in different contexts.

Skill Taxonomy:

  • Description columns define either an occupation or a skill.
  • Descriptions don’t always align with actual requirements or responsibilities, which can shift with context.
    • Example: The occupation Data Analyst can describe the responsibility to “Create visual reports to communicate insights” ➡️ What someone does
    • Meanwhile, the skill Data Visualisation can be described as ”Designing visuals using tools” ➡️ What someone must know
  • Occupation Blurring: The same occupation may be interpreted differently by companies and taxonomies, leading to contextual mismatches.

🤖 ReAct Agent: Why?

  1. ReAct agents fully exhaust a task before progressing to the next, ensuring thorough and coherent reasoning.
  2. Deep Learning ReAct agents implement the following principles:
    • 🗃️ Planning: For example, creating a TODO list to plan and repeat the objective (reason why the agent decided to "tick off a task")
    • 📝 Offload context: Capturing notes to assist the agent in accomplishing a task.
    • 🤝 Task delegation: Can delegate tasks to sub-agents that specialise in the task at hand.
    • 💬 Use Careful & Extensive Prompt Engineering: Building prompts to describe, constrain, and clarify subsequent processes to the Agent as a “system prompt” (see example here).

Due to their capacity to reason and act iteratively, powered by a strong LLM foundation, ReAct agents excel in interpreting nuanced, context-rich data. In job advertisement analysis, where skills, responsibilities, and requirements are often vague or context-dependent, ReAct agents overcome this by:

  • Actively reasoning through the context of the job ad and the implied meaning.
  • Decomposing ambiguous statements/job-ad sections into clear elements.
  • Decomposing ambiguous statements/job-ad sections into clear elements.

🧰 Tools included

# Name purpose
1 update_content A tool to extract raw context from a job advertisement.
2 write_todo A tool to write concise TASKS to inform and track your progress.
3 read_todos A tool to read the TODOs to remind the ReAct agent of the plan.
4 extract_soft_skills A tool to extract soft skill entities from a job advertisement.
5 extract_hard_skills A tool to extract hard skill entities from a job advertisement.
6 check_for_bothskills A tool to validate and resolve overlapping skills (hard/soft) that were categorised as 'hard' or 'soft' skills.
7 extract_responsibilities A tool to extract responsibilities from a job advertisement.
8 extract_requirements A tool to extract requirements from a job advertisement.
9 evaluate_correctness A tool to perform G-Eval for Correctness.

🔑 Key Concepts Implemented

  1. Entity extractins with LangExtract Google icon
    • A Google tool optimised for long-document entity extraction.
    • Extracts hard skills, soft skills, years of experience, and contact persons with high recall by using chunking, parallel processing, and multi-pass strategies.
  2. Context offloading with InjectedToolCallId and InjectedState.
  3. Prompt base extraction tools
  4. Custom G-Eval functionality for Correctness using Jinja2.
  5. LLM hyperparameter optimisation: Optimising recursion_limit & remaining_steps
  6. Semantical evalution - Evaluating and comparing results between ReAct extractions and skill taxonomy.
    • Embedding generation: TechWolf/JobBERT-v2 sentence-transformer model.
    • Comparison methodology: Cosine Similarity
    • Leveraging Datasets.map() utility
  7. Deep agents princples: 🗃️ Planning, 📝 Offload context, 💬 Careful & Extensive Prompt Engineering

↔️ Symantical Evaluation: Approach

  • Taxonomy Skill Concatenation:
    • Purpose: Combine all possible labels for occupations and skills — including both preferred labels and alternative labels.
    • Reason: Ensures that the evaluation considers all available labels in the taxonomy, capturing variations and synonyms in skill or occupation naming.
  • ReAct – Threshold Alpha on “Correctness”:
    • Correctness (GEval measure): Evaluates how well the agent’s output aligns with the expected results.
    • In Context: Measures whether skills extracted from job ads are accurate and contextually correct.
      • High Alignment Range: 0.8–1.0 (as per GEval definition).
  • Filtering: Only extracted results with Correctness ≥ 0.8 are considered for evaluation to ensure the taxonomy mapping reflects only valid extractions.
  • Cosine Similarity:
    • Computes the similarity between extracted results and taxonomy embeddings.
    • Maps each extracted skill or label to the K closest items in the embedding space.
    • Ensures a semantic alignment between extracted results and the taxonomy, even if exact wording differs.

𐄷 Project Notebooks

# Name Technique View
1 ads.ipynb Data preprocessing and traditional ML techniques
2 ReAct.ipynb Schemas, tools, prompts, and agent definition
3 semantical_eval.ipynb Semantic Evaluation skill-taxonomy <> ReAct exractions

🪾 Project Structure

.
├── LICENSE
├── notebooks
│   ├── ads.ipynb
│   ├── data
│   │   ├── ads_preprocessed.csv
│   │   ├── ads-50k.json
│   │   ├── occupations_en.csv
│   │   ├── sample_results.csv
│   │   └── skills_en.csv
│   ├── output_images
│   │   ├── geval_class.png
│   │   ├── geval_correctness.png
│   │   ├── langextract_json.png
│   │   └── output.png
│   ├── ReAct.ipynb
│   ├── semantical_eval.ipynb
│   ├── softskills
│   │   ├── softskills.html
│   │   └── softskills.jsonl
│   └── utils
│       ├── __init__.py
│       ├── __pycache__
│       │   ├── __init__.cpython-312.pyc
│       │   └── print_utils.cpython-312.pyc
│       └── print_utils.py
├── pyproject.toml
├── README.md
├── src
│   ├── agent
│   │   ├── __init__.py
│   │   ├── evaluation.py
│   │   ├── preprocess_utils.py
│   │   ├── prompts.py
│   │   ├── req_and_res.py
│   │   ├── skill_utils.py
│   │   ├── state.py
│   │   ├── studio
│   │   │   ├── langgraph.json
│   │   │   ├── react.py
│   │   │   └── requirements.txt
│   │   └── todo_utils.py
│   └── data
│       ├── hard_skills.json
│       └── soft_skills.json
├── .gitignore
└── uv.lock

About

A multi-threaded ReAct agent system designed to extract responsibilities, requirements, and skills from job advertisements using deep-learning agent principles and dynamic tool calling.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published