Skip to content

MykleR/minishell

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Note

The aim of the minishell 42 project is to create a lightweight command-line interpreter that reproduces the essential features of bash. What sets this implementation apart is its robust parsing system, completely decoupled from execution, built on LALR(1) grammar principles, producing a clean and efficient Abstract Syntax Tree (AST) for command execution. This project demonstrates advanced parsing techniques commonly studied in compiler construction theory and provides a basis for understanding how some modern shells interpret and execute commands.

✨ Features

  • 🧩 Tokenizer: Flexible and scalable lexical analyzer (lexer) that performs lexical analysis to convert raw input into a stream of classified tokens
  • 🔎 Grammar Parser: Deterministic bottom-up parser using Look-Ahead LR(1) techniques over a context-free grammar (CFG)
  • 🔃 AST Generation: Efficient Abstract Syntax Tree construction via syntax-directed translation—each grammar production rule triggers a corresponding semantic action that builds an AST node
  • 🔗 Efficient Builtins: Implementation of essential shell builtins (cd, echo, exit, etc.)
  • 🧹 Resource Caching: Cached file descriptors and memory allocations with automatic cleanup on program exit
  • Hashmap-powered Environment: Fast O(1) environment variables access
  • 📏 42 School Compliant: Follows 42 School norm and coding standards

🚀 Getting Started

Prerequisites

  • Clang compiler
  • GNU Make
  • readline library

Installation

# Clone the repository
git clone --recurse-submodules https://github.com/MykleR/minishell.git

# Enter the directory and compile project
cd minishell; make

# Run the shell
./minishell

Important

Don't forget --recurse-submodules otherwise dependencies will not be cloned

🔍 Technical Overview

What are LR Parsers?

An LR parser is a powerful tool used by interpreters and compilers to analyze the structure of code or commands. "LR" stands for "Left-to-right" reading of the input, building up the parse tree using "Rightmost derivation" in reverse. This type of parser is a bottom-up parser: it starts with the raw input (like shell commands), gradually groups terminal and non-terminal symbols to form higher-level structures through a process called a reduction, and ultimately recognizes valid syntax by deriving the start symbol.

The Grammar

The shell language is specified by a context-free grammar (CFG)—a formal system in the Chomsky hierarchy (Type-2) that defines all syntactically valid inputs through a set of production rules. Each production maps a single non-terminal symbol (left-hand side) to a sequence of terminals and/or non-terminals (right-hand side). The grammar serves as the foundation for both the parser and AST construction: each reduction of a production rule triggers a semantic action that creates the corresponding AST node—a technique known as Syntax-Directed Definition (SDD).

  • Left side: Non-terminal symbols (productions), used to represent syntactic categories—in our case, AST node types.
  • Right side: The body of the production—a sequence of terminals (tokens) and/or other non-terminals that must be matched for the rule to apply.
program -> list  
list -> list AND list  
list -> list OR list  
list -> list PIPE list  
list -> LPAREN list RPAREN  
list -> command  
redirection -> REDIR_IN arg  
redirection -> REDIR_OUT arg  
redirection -> REDIR_APP arg  
command -> arg  
command -> redirection  
command -> command arg  
command -> command redirection  
arg -> ARG

Note

In formal terms, this grammar is defined by the 4-tuple G = (V, Σ, R, S) where V is the set of non-terminal symbols {program, list, command, redirection, arg}, Σ is the set of terminal symbols {AND, OR, PIPE, LPAREN, RPAREN, REDIR_IN, REDIR_OUT, REDIR_APP, ARG, EOF}, R is the set of 14 production rules listed above, and S = program is the start symbol. The EOF terminal acts as the end-of-input marker used by the parser to signal successful acceptance. Because every production has a single non-terminal on the left-hand side, this grammar is context-free (Type-2 in the Chomsky hierarchy).

Action and Goto Tables

LR parsers rely on two precomputed tabular data structures that collectively encode the parser's finite automaton—specifically, the states and transitions of the deterministic pushdown automaton (DPDA) that recognizes the grammar's language:

  • Action Table: Indexed by the current parser state and the next lookahead terminal, this table dictates the parser's next move. The possible actions are:
    • Shift: Consume the current input token and push it onto the parser stack, transitioning to a new state. This corresponds to advancing the input cursor and accumulating symbols before a reduction can be applied.
    • Reduce: Pop a sequence of symbols from the stack that matches the right-hand side of a production rule, and replace them with the production's left-hand side non-terminal. This is the core mechanism of bottom-up parsing: partial results are combined into higher-level syntactic structures. Each reduction also invokes a semantic action to construct the corresponding AST node.
    • Accept: The input has been fully consumed and reduced to the start symbol. Parsing terminates successfully—the input conforms to the grammar.
    • Error: No valid action exists for the current state and lookahead. The input is syntactically invalid with respect to the grammar.
  • Goto Table: Indexed by a parser state and a non-terminal symbol, this table determines which state to transition to after a reduction replaces the right-hand side symbols with a non-terminal on the stack. It effectively encodes the non-terminal transitions of the underlying LR automaton.

Note

The parser uses a stack to keep track of symbols and parser states. As it shifts tokens and reduces groups of symbols, the stack helps the parser remember where it is and what structures have been recognized so far. Together, the Action and Goto tables fully characterize the LALR(1) automaton—they are the central data structures generated by parser construction algorithms such as those described in the Dragon Book (Compilers: Principles, Techniques, and Tools).

How It's Used in minishell

In our minishell project, the Action and Goto tables are precomputed and embedded directly into the parsing engine as hardcoded static lookup tables. The resulting parser shares the same table-driven LALR(1) architecture that tools like Yacc and Bison produce. When the user enters a command, the parser drives the LALR(1) automaton by consulting these tables at each step to determine whether to shift, reduce, accept, or signal a syntax error. Each reduce action invokes a production function (semantic action) that constructs the appropriate AST node, effectively implementing a Syntax-Directed Translation Scheme (SDTS). This table-driven, deterministic approach allows minishell to parse complex shell command syntax in linear time O(n) with respect to the number of input tokens.

Tip

You can generate tables, visualize parse trees and try other grammars with this online tool LALR(1) Parser Generator.

🔄 Processing Pipeline

flowchart LR
    A[Input] -->|Lexer|B(TOKENS)
    B -->|"special tokens" |G(HEREDOC)
    G -->|Parser |C{AST}
    C -->|Execution |C
Loading
  1. Input Capture:
    • GNU Readline for input with history support
  2. Tokenization (Lexical Analysis):
    • The lexer performs lexical analysis: the input string is scanned and decomposed into a stream of tokens (the smallest meaningful units, or lexemes) such as redirections >>, pipes |, or words
    • Each token is classified into a terminal symbol category based on its role in the shell language (e.g., T_PIPE, T_ARG, T_REDIR_OUT)
    • You will find an exhaustive list of all the token types in headers/lexer.h
    • Token enum values are significant: they serve as indices into the Action table, mapping each terminal symbol to its column in the parser's lookup structure
  3. Heredoc Processing:
    • Handles heredocs and converts them to input redirections < pointing to a temporary file
    • Forks the process to allow interactive readline input for the heredoc body
  4. AST Construction (Syntax-Directed Translation):
    • The LALR(1) parser consumes the token stream and drives a deterministic bottom-up parse
    • As grammar production rules are recognized (i.e., reduced), the associated semantic actions are invoked to construct the corresponding AST nodes—this is a Syntax-Directed Translation Scheme (SDTS) where the translation is driven by the syntactic structure itself
    • Nodes are connected to form a tree structure representing the command hierarchy: binary expression nodes for operators (PIPE, AND, OR), command nodes for simple commands, and redirection nodes for I/O redirections
    • The resulting AST is an Intermediate Representation (IR) that captures command relationships, execution order, and operator precedence
  5. Tree Evaluation (AST Execution):
    • The execution engine recursively traverses the AST, dispatching each node to its corresponding handler based on its type (command, redirection, pipe, logical operator, subshell)
    • For logical operators (AND, OR), evaluation follows short-circuit semantics: the right operand is evaluated only if the left operand's exit code requires it, mirroring standard shell behavior
    • For pipes, both sides are forked and executed concurrently, with the left child's stdout connected to the right child's stdin
    • Execution results (exit codes) propagate up the tree—analogous to synthesized attributes in attribute grammar theory—enabling parent nodes to determine control flow based on the success or failure of their children

📚 Further Reading

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •