Open In App

SARSA (State-Action-Reward-State-Action) in Reinforcement Learning

Last Updated : 26 Feb, 2025
Summarize
Comments
Improve
Suggest changes
Like Article
Like
Share
Report
News Follow

SARSA (State-Action-Reward-State-Action) is an on-policy learning algorithm used for this purpose. It helps an agent learn an optimal policy based on experience, where the agent improves its policy while continuously interacting with the environment.

SARSA outlines the key components of the RL process:

  1. State (S): The current state of the environment.
  2. Action (A): The action taken by the agent in a given state.
  3. Reward (R): The immediate reward received after taking action A in state S.
  4. Next State (S’): The state resulting from the action A in state S.
  5. Next Action (A’): The action chosen in the next state S’ according to the current policy.

The SARSA algorithm updates the Q-value for a given state-action pair based on the reward and the next state-action pair, leading to an updated policy.

SARSA is on-policy, meaning it learns from the current policy’s actions rather than optimizing actions through a model of the environment.

SARSA Update Rule

The core idea of SARSA is to update the Q-value for each state-action pair based on the actual experience (i.e., what the agent does while following its policy). The Q-value is updated using the following Bellman Equation for SARSA:

[Tex]Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) – Q(s_t, a_t) \right][/Tex]

Where:

  • [Tex]Q(s_t, a_t)[/Tex] is the current Q-value for the state-action pair at time step t.
  • [Tex]α[/Tex] is the learning rate (a value between 0 and 1), which determines how much the Q-values are updated.
  • [Tex]r_{t+1}[/Tex]​ is the reward received after taking action [Tex]a_t[/Tex]​ in state [Tex]s_t[/Tex]​.
  • [Tex]γ[/Tex] is the discount factor (between 0 and 1), which determines the importance of future rewards.
  • [Tex]Q(s_{t+1}, a_{t+1})[/Tex] is the Q-value for the next state-action pair.

In this update rule:

  • The agent updates its Q-value using both the immediate reward and the expected future reward.
  • The difference between the predicted Q-value and the actual reward is used to correct the current estimate.

SARSA Algorithm

Here’s a high-level view of the SARSA algorithm:

1. Initialize:

  • Initialize Q-values arbitrarily for each state-action pair.
  • Choose an initial state [Tex]s_0[/Tex]​.

2. Loop for each episode:

  • Set the initial state [Tex]s_t[/Tex]​ and choose an action [Tex]a_t[/Tex]​ based on a policy (e.g., [Tex]\varepsilon[/Tex]-greedy).
    • Repeat for each step in the episode:
    • Take action [Tex]a_t[/Tex]​, observe reward [Tex]R_{t+1}[/Tex]​, and transition to state [Tex]s_{t+1}[/Tex]​.
    • Choose the next action [Tex]a_{t+1}[/Tex]​ from state [Tex]s_{t+1}[/Tex] based on the policy.
    • Update the Q-value for the state-action pair [Tex](s_t, a_t)[/Tex] using the SARSA update rule.
    • Set [Tex]s_t = s_{t+1}[/Tex]​ and [Tex]a_t = a_{t+1}[/Tex]​.
  • Continue until the episode ends (when the agent reaches a terminal state or after a fixed number of steps).

SARSA Implementation in Grid World

Step 1: Define the Environment (GridWorld)

The environment is a grid world where the agent can move up, down, left, or right. The environment includes:

  • A start position.
  • A goal position.
  • Obstacles that the agent should avoid.
  • A reward structure (e.g., negative reward for hitting obstacles, positive reward for reaching the goal).

The GridWorld class handles the environment dynamics, including resetting the state and taking actions.

Python
class GridWorld:
    def __init__(self, width, height, start, goal, obstacles):
        self.width = width
        self.height = height
        self.start = start
        self.goal = goal
        self.obstacles = obstacles
        self.state = start

    def reset(self):
        self.state = self.start
        return self.state

    def step(self, action):
        x, y = self.state
        if action == 0:  # Up
            x = max(x - 1, 0)
        elif action == 1:  # Down
            x = min(x + 1, self.height - 1)
        elif action == 2:  # Left
            y = max(y - 1, 0)
        elif action == 3:  # Right
            y = min(y + 1, self.width - 1)

        next_state = (x, y)

        if next_state in self.obstacles:
            reward = -10
            done = True
        elif next_state == self.goal:
            reward = 10
            done = True
        else:
            reward = -1
            done = False

        self.state = next_state
        return next_state, reward, done


Step 2: Define the SARSA Algorithm

SARSA algorithm iterates over episodes, updating the Q-values based on the agent’s experience.

Python
def sarsa(env, episodes, alpha, gamma, epsilon):
    # Initialize Q-table with zeros
    Q = np.zeros((env.height, env.width, 4))

    for episode in range(episodes):
        state = env.reset()
        action = epsilon_greedy_policy(Q, state, epsilon)
        done = False

        while not done:
            next_state, reward, done = env.step(action)
            next_action = epsilon_greedy_policy(Q, next_state, epsilon)

            # SARSA update rule
            Q[state[0], state[1], action] += alpha * (reward + gamma * Q[next_state[0], next_state[1], next_action] - Q[state[0], state[1], action])

            state = next_state
            action = next_action

    return Q

Step 3: Define the Epsilon-Greedy Policy

The epsilon-greedy policy balances exploration and exploitation:

  • With probability ϵ, the agent chooses a random action (exploration).
  • With probability 1−ϵ, the agent chooses the action with the highest Q-value for the current state (exploitation).
Python
def epsilon_greedy_policy(Q, state, epsilon):
    if random.uniform(0, 1) < epsilon:
        return random.randint(0, 3)  # Random action
    else:
        return np.argmax(Q[state[0], state[1]])  # Greedy action

Step 4: Set Up the Environment and Run SARSA

This step involves:

  • Defining the grid world parameters (width, height, start, goal, obstacles).
  • Setting the SARSA hyperparameters (episodes, learning rate, discount factor, exploration rate).
  • Running the SARSA algorithm and printing the learned Q-values.
Python
if __name__ == "__main__":
    # Define the grid world environment
    width = 5
    height = 5
    start = (0, 0)
    goal = (4, 4)
    obstacles = [(2, 2), (3, 2)]
    env = GridWorld(width, height, start, goal, obstacles)

    # SARSA parameters
    episodes = 1000
    alpha = 0.1  # Learning rate
    gamma = 0.99  # Discount factor
    epsilon = 0.1  # Exploration rate

    # Run SARSA
    Q = sarsa(env, episodes, alpha, gamma, epsilon)

    # Print the learned Q-values
    print("Learned Q-values:")
    print(Q)

Output:

Learned-Q-Values

After running the SARSA algorithm, the Q-values represent the expected cumulative reward for each state-action pair. The agent uses these Q-values to make decisions in the environment. Higher Q-values indicate better actions for a given state.

Exploration Strategies in SARSA

SARSA typically uses an exploration-exploitation strategy to choose actions. The most common strategy is ε-greedy, where:

  • Exploitation: With probability [Tex]1 – \epsilon[/Tex], the agent chooses the action with the highest Q-value.
  • Exploration: With probability [Tex]\epsilon[/Tex], the agent chooses a random action to explore new possibilities.

Over time, [Tex]ϵ[/Tex] is often decayed (reduced) to shift from exploration to exploitation as the agent gains more experience in the environment.

Strengths of SARSA

  1. Stability with Exploration: SARSA is more stable in environments where the agent needs to balance exploration and exploitation.
  2. On-Policy Learning: Since SARSA learns from the policy the agent is actually following, it is better suited for situations where the agent must learn from its exploration.

Weaknesses of SARSA

  1. Slower Convergence: Because SARSA is on-policy, it might take longer to converge compared to off-policy methods like Q-learning, especially when there is a lot of exploration.
  2. Sensitive to the Exploration Strategy: The agent’s performance depends on how exploration is managed (e.g., how [Tex]\epsilon[/Tex] is decayed in the ε-greedy strategy).

Understanding SARSA is essential for building RL agents that can effectively learn from their interactions with the environment, especially when an on-policy approach is required.



Next Article
Article Tags :
Practice Tags :

Similar Reads

three90RightbarBannerImg
  翻译: