Efficient Exploration for Reinforcement
Learning Based Distributed Spectrum Sharing
in Cognitive Radio System

U. Kiran; D. Praveen Kumar; K. Rajesh Reddy; M. Ranjith

Efficient Exploration for Reinforcement Learning Based Distributed Spectrum Sharing in Cognitive Radio System

U. Kiran¹, D. Praveen Kumar¹, K. Rajesh Reddy², M. Ranjith¹

Assistant Professor, Dept. of ECE, Talla Padmavathi College of Engineering, Warangal, India
Assistant Professor, Dept. of ECE, K.U. College of Engg. Warangal, India

Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

Abstract

In this paper, we investigate how distributed reinforcement learning-based resource assignment algorithms can be used to improve the performance of a cognitive radio system. Today’s decision making in most wireless systems include cognitive radio systems in development, depends purely on instantaneous measurement. Two system architectures have been investigated in this paper. A point-to-point architecture is examined first in an open spectrum scenario. Then, the distributed reinforcement learning-based algorithms are developed by modifying the traditional reinforcement learning model in order to be applied to a fully distributed cognitive radio system.

Keywords

cognitive radio, resource assignment, spectrum sensing, point-to-point architecture, distributed reinforcement learning

INTRODUCTION

The assignment of spectrum to transmissions and to users is a fundamental issue of wireless communications. Numerous channel assignment methods have been proposed for sharing the limited physical resource. The traditional licensed spectrum allocation strategies employed by radio regulatory bodies is very restrictive and extremely inflexible, resulting in highly underutilized spectrum usage. A fully dynamic spectrum access technique called Cognitive Radio which was first introduced in [1, 2], has been considered as a potential way to improve the inefficient spectrum utilization. The inefficient usage of the existing spectrum can be improved through opportunistic access to the licensed bands without interfering with the existing users. The definition of cognitive radio suggested by ITU-R [3] is: ‘a radio system employing a technology, which makes it possible to obtain knowledge of its operational environment, policies and internal state, to dynamically adjust its parameters and protocols according to the knowledge obtained and to learn from the results obtained’. The fundamental objective of cognitive radio is to enable an efficient utilization of the wireless spectrum through a highly reliable approach. Although a cognitive radio may be able to analyze the physical environment before it sets up a communication link, the best system performance is unlikely to be achieved by either a random spectrum sensing strategy or a fixed spectrum sensing policy.

Reinforcement learning (RL), a sub-area of machine learning, uses a mathematical way to evaluate the success level of actions [4, 5]. Its emphasis on individual learning from the direct interactions with the environment makes it perfectly suited to distributed cognitive radio scenarios. There are mainly two reasons to consider the reinforcement learning as the most suitable learning approach for cognitive radio systems. The first reason: Reinforcement learning is an individual learning approach where the learning agent learns only on local observations and the second is: Reinforcement learning learns on a trial-and-error basis that no environment model is required. This is also perfectly suited to cognitive radio systems which constantly interact with an ‘unknown’ radio environment on a trial-and-error basis.

This paper introduces the reinforcement learning-based distributed spectrum sharing (RL-DSS) scheme which enables efficient usage of spectrum by exploiting users past experience. In the proposed spectrum sharing scheme, a reward value is assigned to a used resource based on the reward function. Cognitive radio users select spectrum resources to use based on the weight values assigned to the spectral resources - resources with higher weights are considered higher priority. Furthermore we investigate and compare the system performance of different sets of reward values which effectively are the weighting factors in the reward function. In fact, we will show how different weighting factor values have significant impact on the system performance, and that inappropriate weighting factor setting may cause some specific problems.

The reminder this paper is organized as follows. The cognitive radio based reinforcement learning model will be presented in section II. Reinforcement learning-based distributed spectrum sharing algorithm is described in section III. Section IV presents the key measurements for evaluating the system, Section V presents the simulation results to validate the analysis, and Section VI concludes the paper.

SYSTEM MODEL

The reinforcement learning model developed for the cognitive radio scenario is illustrated in Figure 1. The wireless spectrum is effectively the environment in which cognitive radio (CR) is the learning agent. The way we implement reinforcement learning in the CR scenario is slightly different from the original reinforcement learning model. This is caused by a few built-in features of cognitive radio. In the original reinforcement learning system, the value of the current state s under a policy π which is denoted by Vπ(s) is the basis to choose the action A(s). An optimal policy is supposed to maximize Vπ(s) at each trial. Vπ(s) is formally defined as [4]:

(1)

(2)

(3)

is effectively the cumulative reward in the state of s. The other part of the equation is the expected feedback of its successor states s’. It can be clearly seen from equation (1) to equation (3) that in order to obtain the optimal policy π*, the information of s’ is vital. Information like the number of potential successor states and the estimated value of each of the state’s s’ are essential.

Our strategy is to develop a policy π that maps memory (weight values) to action π : W → A instead of the original approach which maps the state of environment to action π: S → A [12]. On one hand, the agents are fully distributed in our strategy so that decisions are made only according to the local measurements. It is unlikely for a CR to obtain the information at the network level. Cognitive radio is able to sense the target spectrum before activation and it is not supposed to transmit data until unoccupied spectrum has been found. Choosing the most successful spectrum by reinforcement learning combined with spectrum sensing is the suggested method. A few amendments have been made to the learning model. The reinforcement learning model which we used consists of [4]: A set of memories, W. W is a set of weights of the performed actions which are stored in the knowledge base; a set of actions, A; A set of numerical rewards R.

A CR will access the communication resource according to the memory of reinforcement learning. The success level of a particular action, which is whether the target spectrum is suitable for the considered communication request, is assessed by the learning engine. Based on the assessment, a reward is assigned in order to reinforce the weight of the performed action in the knowledge base. Since the actions are all strongly connected to the target resources, the weight is practically a number which is attached to a used resource and this number reflects the successful level of the resource. Our goal is to develop an optimal policy mapping weight to action π : W → A that can maximize the value of the current memory Vπ*(w). Given a set of available weights of used resources and a policy π, the selection of a specific action is denoted as a = π(w). Then the optimal value function under the optimal policy π* can be defined as:

(4)

Where w is the weight of used resources of an agent at time t, w’ is the expected values of weights after agent takes an action

is the probability of selecting an action after taking the action π*. The optimal policy can be specified as:

(5)

At each communication request the agent chooses a resource which can maximize V*(w) according to its current memory. Based on the result, the learning engine updates the knowledge base by a reward r. The inner loop within cognitive radio in figure 1 will proceed constantly to update the knowledge base; the complexity of the communication system is reduced.

A key element of reinforcement learning is the value function [8]. A CR user updates its knowledge based on the feedback of the value function. In other words, the CR user adjusts its operation according to the function. The following linear function is used as the objective function to update the spectrum sharing strategy in this paper [6, 7]:

(6)

DISTRIBUTED REINFORCEMENT LEARNING - CR SPECTRUM SHARING SCHEME

PERFORMANCE EVALUATION

In this paper we evaluated few performance parameters of the system capacity. Signal-to-Interference-plus-Noise- Ratio (SINR) is used to evaluate link quality, i.e. to determine whether the current user will lose its current service, or to determine the data rate depending on the adaptive modulation applied to the system. Blocking probability and dropping probability are normally used to evaluate link based wireless system, e.g. speech-oriented wireless service. The Cumulative Distribution Function (CDF) is used to process the initial data and to deliver the statistical behavior of the results.

1) Signal-to-Interference-plus-Noise-Ratio (SINR): Signal-to-Interference-and-Noise Ratio (SINR) [9], also known as Carrier-to Interference-and-Noise Ratio (CINR), is one of the fundamental parameters to measure the link quality of users in wireless communication. It is defined by the quotient of the average received signal power (S or C) and the average received co-channel interference power (I) plus the noise power from other sources (N). In point to point architecture the SINR has been derived:

(7)

Where p is the transmit power of the n transmitter, g is the gain of the wireless link on channel q, is the noise power. A frequency separation of backhaul and access is assumed so that the backhaul network and the access network do not interfere with each other. Then for the backhaul network, SINR measured at ABS n (signal from HBS m in channel q and sub-channel r) can be derived as:

(8)

Where

is the link gain between ABS n and MS k. In the denominator first term is the interference from all the ABSs in other cells that are using the same frequency., and the second one is the interference from other ABSs in the same cell, and σ2 is the noise power.

2) Cumulative Distribution Function (CDF): As we mentioned before, in order to obtain statistically accurate results we need to apply Monte Carlo simulation. However, a very large amount of unprocessed data can be expected by conducting Monte Carlo simulation. Appropriate mathematical analysis in this case is required to show the statistical behavior of the results. The cumulative distribution function is the main statistical method applied in this report. The CDF of x is defined as [10]:

(10)

where f(x) is the probability density function of x. The results of our simulation like blocking probability and dropping probability are mainly measured at regular points in the service area.

3) Blocking Probability and Dropping Probability: Blocking probability and dropping probability [11] are the measurements we use to evaluate the grade of service. The blocking probability at time t can be defined as:

(11)

Where P(t) is the blocking probability at time Nb (t) is the total number of blocked activations of the system by time t and Na (t) is the total number of activations of the system by time t. Similarly, the dropping probability is defined as follows:

(12)

Where PD(t) is the dropping probability by time t. ND(t) is the total number of dropped transmissions by time t and Nsa(t) is the total number of accepted activations by time t.

SIMULATION RESULTS

In this paper we employed an event-based scenario and at each event a random subset of pairs are activated, system parameters used in this paper is shown in table II. The available spectrum will be partitioned autonomously by individual reinforcement learning and therefore CR users are able to avoid improper spectrum. Figure 3 (a)-(b) represent how the channel partitioning emerges during the simulation. A small number of 10 is used in this simulation to define the number of available channels and the number of users.

At the beginning of the simulation (Figure 3 (a)), CR users use almost all resources equally. After a certain simulation time, at event 100 (Figure 3 (b)) a few channels already show their priority to certain users, like user 3 prefers channel 8 and user 2 prefers channel 3. However, the channel usage of user 1 is still fairly equal at this stage. It can be seen that a spectrum sharing equilibrium is established and therefore the channel usage converged to few preferred channels. The CR users are able to avoid collisions by utilizing their experience from learning consequently.

Figure 4 – Figure 5 illustrate the CDF of Blocking and Dropping probability respectively. Blocking probability is measured at regular points in the service area and a Cumulative Distribution Function (CDF) of system blocking probability at these points is derived. In order to analyze the level of system interruption, a CDF of dropping probability is calculated at the same time. All CR users’ parameters are exactly the same for each scheme evaluation, with different system performance being caused only by different weighting factor values.

CONCLUSION

In this paper, we introduced a reinforcement learning model for cognitive radio and a few basic reinforcement learning-based spectrum sharing schemes. By utilizing the ability of learning, cognitive agents can remember their preferred communication resources and enable an efficient approach to spectrum sensing and sharing accordingly. Simulation results show that reinforcement learning-based spectrum sharing algorithms achieve a better system performance compared to non-learning algorithms.

References

J. Mitola and G. Maguire, "Cognitive radio: making software radios more personal," IEEE Personal Communication, vol. 6, pp. 13-18, Aug,1999.

J. Mitola, "Cognitive Radio: An Integrated Agent Architecture for Software Defined Radio," Ph.D., Teleinformatics, Royal Institute ofTechnology (KTH), May, 2000.

ITU-R. WRC-12 Agenda Item 1.19: Software-Defined Radio (SDR) and Cognitive Radio Systems (CRS). 2010.

R. S. Sutton and A. G. Barto, Reinforcement learning : An Introduction: The MIT Press, 1998.

L. P. Kaelbling, et al., "Reinforcement Learning: A Survey," Journal of artificial intelligence Research, vol. 4, pp. 237-285, May. 1996.

M. Bublin, et al., "Distributed spectrum sharing by reinforcement and game theory," presented at the 5th Karlsruhe workshop on softwareradio, Karlsruhe, Germany, March. 2008.

T. Jiang, et al., "Performance of Cognitive Radio Reinforcement Spectrum Sharing Using Different Weighting Factors," presented at theInternational Workshop on Cognitive Networks and Communications (COGCOM) in conjunction with CHINACOM'08, , Hangzhou, China,August, 2008.

S. Kapetanakis and D. Kudenko, "Reinforcement learning of coordination in cooperative multi-agent systems," presented at the Eighteenthnational conference on Artificial intelligence, Edmonton, Alberta, Canada, 2002.

S. Saunders, Antennas and propagation for wireless communication systems: Wiley, 1999.

N. Drakos, "Introduction to Monte Carlo Methods," Computer Based Learning Unit, University of Leeds, Aug 1994.

J. D. Gibson, The Mobile Communications Handbook, 1st ed.: IEEE Press, 1996.

T. Jiang, et al., "Two Stage Reinforcement Learning Based Cognitive Radio with Exploration Control," accepted by IET Communications,2009.