Data Structures

Last updated on 2026-02-20 | Edit this page

Overview

Questions

  • How can I store information in Python?
  • How can I access the elements of an array or a dictionary?
  • How can I efficiently manage tabular data in Python?

Objectives

  • Understand fundamental data structures such as lists, arrays, dictionaries, and Pandas dataframes.
  • Efficient data manipulation, selection and filtering.
  • Tabular data management with Pandas.
  • Read data from a URL.

List


A list is an indexed, ordered, changeable sequence of elements that can be of different types and it allows duplicate values. They are defined using square brackets []:

PYTHON

thislist = ["bacteria", "archea", "fungus", 4]
print(thislist)

OUTPUT

['bacteria', 'archea', 'fungus', 4]

To access elements in a list, we can use their index.

PYTHON

first_element = thislist[0]
print(first_element)

OUTPUT

bacteria

We can also access elements in the list using negative indices, which count from the end of the list towards the beginning.

PYTHON

print(thislist[-1])

OUTPUT

4

You can add elements to a list using the append() method to add an element to the end of the list.

PYTHON

thislist.append('animal')
print(thislist)

OUTPUT

['bacteria', 'archea', 'fungus', 4, 'animal']

You can also extend a list by adding all the elements from another list using the extend() method.

PYTHON

thislist.extend([1,2,3])
print(thislist)

OUTPUT

['bacteria', 'archea', 'fungus', 4, 'animal', 1, 2, 3]

You can remove elements from a list using the word del followed by the index of the element you want to remove.

PYTHON

del thislist[3]
print(thislist)

OUTPUT

['bacteria', 'archea', 'fungus', 'animal', 1, 2, 3]

You can also use the remove() method to remove a specific element by its value.

PYTHON

thislist.remove('animal')
print(thislist)

OUTPUT

['bacteria', 'archea', 'fungus', 1, 2, 3]

You can concatenate two lists using the + operator or the extend() method.

PYTHON

mylist1 = ['bacteria', 'archea', 'fungus']
mylist2 = ['animal', 'algae', 'plant']

newlist = mylist1 + mylist2
print(newlist)

OUTPUT

['bacteria', 'archea', 'fungus', 'animal', 'algae', 'plant']

The sort() method sorts the elements of the list in ascending order by default. All elements need to be of the same type. If we try to sort the elements of the list thislist, we will get an error.

PYTHON

thislist.sort()

OUTPUT

TypeError: '<' not supported between instances of 'int' and 'str'

PYTHON

newlist.sort()
print(newlist)

OUTPUT

['bacteria', 'archea', 'fungus', 'animal', 'algae', 'plant']

To sort them in descending order:

PYTHON

newlist.sort(reverse = True)
print(newlist)

OUTPUT

['plant', 'algae', 'animal', 'fungus', 'archea', 'bacteria']

Consider that the sort() method modifies the original list and does not return any value. To sort a list without modifying the original, you can use the sorted() function, which returns a new sorted list.

Callout

Select elements of a list with slices

Another way to access list data in Python is by using the “slice” method: list[start:end]. This slice includes the elements whose indices are in the range from start to end - 1:

  • start: It is the index from which we start including elements in the slice.
  • end: It is the index up to which we include elements in the slice, but not including the element at this index.

For example, if we want to get a slice of newlist from index 1 to index 4, we would use the notation newlist[1:4] and we will get a slice that includes elements from index 1 (inclusive) to index 4 (exclusive).

PYTHON

newlist[1:4]

OUTPUT

['algae', 'animal', 'fungus']

Another syntax of “slices” in Python with the notation list[start🔚step] refers to the technique for selecting a subset of elements from a list using three parameters:

  • start: Indicates the index from which we start including elements in the slice.
  • end: Indicates the index up to which we include elements in the slice, excluding the element at this index.
  • step: Indicates the size of the step or increment between the selected elements. That is, how many elements we skip in each step.

If we run newlist[::2], we get a slice that includes elements of the newlist, but selecting every second element. It’s like starting from the beginning of the list, ending at the end, and selecting every second element.

PYTHON

newlist[::2]

OUTPUT

['plant', 'animal', 'archea']
Challenge

Exercise 1(Begginer): Manipulating lists

You are working with genetic data represented by the following list:

PYTHON

genes = ["AGCT", "TCGA", "ATCG", "CGTA"]

How can you remove the sequence “TCGA” and add a new sequence “CATG”? What is the sequence at index 2.

You can remove “TCGA” from the list using the remove() method:

PYTHON

genes.remove("TCGA")

To add “CATG” to the list, you can use the append() method:

PYTHON

genes.append("CATG")

To find the sequence at index 2, you can access the element at that index using indexing.

sequence_at_index_2 = genes[2]
print(sequence_at_index_2)

OUTPUT

CGTA

Dictionary


Dictionaries (dict) are collections of key-value pairs. Each element in the dictionary has a key and an associated value. They are defined using curly braces {} and separating the keys and values with colons :. Dictionaries in Python are mutable, meaning you can add, modify, and delete elements as needed.

PYTHON

thisdict = {
  "brand": "Ford",
  "model": "Mustang",
  "year": 1964
}
print(thisdict)

OUTPUT

{'brand': 'Ford', 'model': 'Mustang', 'year': 1964}

You can access the values of the dictionary using their keys.

PYTHON

print(thisdict['brand']) 

OUTPUT

Ford

PYTHON

print(thisdict['year'])

OUTPUT

1964

If you try to access a key that does not exist in the dictionary, Python will raise a KeyError.

PYTHON

print(thisdict['color'])

OUTPUT

KeyError: 'color'

You can add a new key-value pair to the dictionary by simply assigning a value to a new key.

PYTHON

thisdict['color'] = 'red'
print(thisdict)

OUTPUT

{'brand': 'Ford', 'model': 'Mustang', 'year': 1964, 'color': 'red'}

You can modify the value associated with an existing key in the dictionary.

PYTHON

thisdict['year'] = 2022
print(thisdict)

OUTPUT

{'brand': 'Ford', 'model': 'Mustang', 'year': 2022, 'color': 'red'}

You can delete an item from the dictionary using the del word.

PYTHON

del thisdict['color']
print(thisdict)

OUTPUT

{'brand': 'Ford', 'model': 'Mustang', 'year': 2022}

You can also use the pop() method to remove an item and return its value.

PYTHON

thisdict['color'] = 'red'
deleted_value = thisdict.pop('color')
print(deleted_value) 

OUTPUT

red

PYTHON

print(thisdict)

OUTPUT

{'brand': 'Ford', 'model': 'Mustang', 'year': 2022}

Reserved methods for dictionaries include keys(), values(), and items(). These methods provide convenient ways to access different aspects of a dictionary:

  • keys(): This method returns a view object that displays a list of all the keys in the dictionary. It allows you to iterate over the keys or convert them into a list if needed.

PYTHON

keys = thisdict.keys()
print(keys)  

OUTPUT

dict_keys(['brand', 'model', 'year'])
  • values(): This method returns a view object that displays a list of all the values in the dictionary. Similar to keys(), it allows iteration over the values or conversion into a list.

PYTHON

values = thisdict.values()
print(values)

OUTPUT

dict_values(['Ford', 'Mustang', 2022])
  • items(): This method returns a view object that displays a list of tuples, where each tuple contains a key-value pair from the dictionary. It’s useful for iterating over both keys and values simultaneously, or for converting the dictionary into a list of key-value pairs.

PYTHON

items = thisdict.items()
print(items) 

OUTPUT

dict_items([('brand', 'Ford'), ('model', 'Mustang'), ('year', 2022)])
Callout

Modify a dictionary

You can modify (update) a dictionary using the update() method. When you call the update() method on a dictionary, you pass another dictionary as an argument. The method then iterates over the key-value pairs in the second dictionary and adds them to the first dictionary. If any keys in the second dictionary already exist in the first dictionary, their corresponding values are updated to the new values.

PYTHON

otherdict = {'model': 'Focus', 'price': 30000}
thisdict.update(otherdict)
print(thisdict)

OUTPUT

{'brand': 'Ford', 'model': 'Focus', 'year': 2022, 'price': 30000}

To add new entries to our dictionary, we use the append() method.

PYTHON

thisdict = {
   'brand': ['Ford', 'Toyota'],
   'model': ['Mustang', 'Corolla'],
   'year': [1964, 2020],
   'color': ['red', 'blue'],
   'price': [15000, 20000]
}
# Add a new brand, model, year, color, and price
new_brand = 'Honda'
new_model = 'Civic'
new_year = 2019
new_color = 'green'
new_price = 18000

thisdict['brand'].append(new_brand)
thisdict['model'].append(new_model)
thisdict['year'].append(new_year)
thisdict['color'].append(new_color)
thisdict['price'].append(new_price)

# Print the updated dictionary
print(thisdict)

OUTPUT

{'brand': ['Ford', 'Toyota', 'Honda'], 'model': ['Mustang', 'Corolla', 'Civic'], 'year': [1964, 2020, 2019], 'color': ['red', 'blue', 'green'], 'price': [15000, 20000, 18000]}

One way to get the model and year values for each brand is as follows.

PYTHON

# Get the year and model of each car directly from the dictionary
ford_year = thisdict['year'][thisdict['brand'].index('Ford')]
ford_model = thisdict['model'][thisdict['brand'].index('Ford')]

toyota_year = thisdict['year'][thisdict['brand'].index('Toyota')]
toyota_model = thisdict['model'][thisdict['brand'].index('Toyota')]

# Print the year and model of each car
print("Ford's Year:", ford_year)
print("Ford's Model:", ford_model)
print("Toyota's Year:", toyota_year)
print("Toyota's Model:", toyota_model)

OUTPUT

Ford's Year: 1964
Ford's Model: Mustang
Toyota's Year: 2020
Toyota's Model: Corolla
Discussion

Exercise 2(Begginer): Manipulating Dictionaries

Suppose that you have the following dictionary:

genes_dict = {"gene1": { "name": "BRCA1", "start_position": 43044295, "end_position": 43125483},
              "gene2": { "name": "TP53", "start_position": 7571720, "end_position": 7588830},
              "gene3": { "name": "EGFR", "start_position": 55086715, "end_position": 55225454}}
~~~
{: .language-python}

Which command correctly adds a new gene "gene4" with its associated information to an existing genes dictionary in Python?

a) `genes_dict.add("gene4", {"name": "GENE4", "start_position": 123456, "end_position": 234567})`

b) `genes_dict["gene4"] = {"name": "GENE4", "start_position": 123456, "end_position": 234567}`

c) `genes_dict.append("gene4": {"name": "GENE4", "start_position": 123456, "end_position": 234567})`

d) `genes_dict.insert(4, {"name": "GENE4", "start_position": 123456, "end_position": 234567})`

> ## Solution
>
> The correct answer is option b).
> It uses the indexing syntax to add a new key-value pair to the genes_dict dictionary, where the key is "gene4" and the value is a dictionary representing the gene's characteristics.
>
{: .solution}

Array


An method for creating and manipulating multidimensional arrays is by using NumPy. NumPy serves as a foundational library for scientific computing in Python, offering robust support for multidimensional arrays and an extensive array of mathematical functions for efficient memory utilization and rapid numerical operations.

To import it, we use the following.

PYTHON

import numpy as np

To create a one-dimensional array, we use the following.

PYTHON

myarray = np.array([1,2,3,4,5,6])
print(myarray)

OUTPUT

[1 2 3 4 5 6]

In NumPy, there are functions to create arrays of zeros or ones. To create an array filled with zeros, you can use np.zeros(shape), where shape is the desired shape of the array:

PYTHON

zeros_array = np.zeros(10)
print(zeros_array)

OUTPUT

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

To create an array filled with ones, you can use np.ones(shape).

PYTHON

ones_array = np.ones(10)
print(ones_array)

OUTPUT

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Callout

Other ways to create arrays and multidimensional arrays

Another way to create arrays is by using sequences with arange(), for example:

PYTHON

np.arange(10)

OUTPUT

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Or using a notation similar to slices: arange(start, stop, step):

PYTHON

np.arange(1,10,1)

OUTPUT

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

We can also create them using random values:

PYTHON

np.random.randint(0, 10, 5)

OUTPUT

array([9, 6, 0, 2, 7])

Multidimensional arrays can be created in various ways by specifying the dimensions of each dimension. For example, to create a 2-dimensional array filled with zeros, ones, or random values:

PYTHON

array_zeros = np.zeros((3,3))
array_ones = np.ones((3,3))
array_rand = np.random.randint(0,10,(3,3))

print(array_zeros)
print(array_ones)
print(array_rand)

OUTPUT

[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]

[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]

[[4 5 4]
[1 7 8]
[7 9 9]]

Another way to create a two dimensional array is:

PYTHON

arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr_2d)

OUTPUT

[[1 2 3]
[4 5 6]
[7 8 9]]

The expression len(array) returns the length of the array, which corresponds to the number of elements in the array. For a one-dimensional array, this is the number of elements it contains. For a two-dimensional array, it is the number of rows in the array.

PYTHON

length = len(ones_array)
print(length)

OUTPUT

10

The NumPy library also allows us to load databases using loadtxt. We will use a toy dataset to learn how to import a csv file into a numpy array. The request module is used in Python to make HTTP requests to web servers.

PYTHON

import requests

#  The url to the csv file
url = "https://raw.githubusercontent.com/carpentries-incubator/pangenomics/gh-pages/files/spiral_2d.csv"

# Get the content of the file
response = requests.get(url)
content = response.text

# Load the data into a NumPy array
data = np.loadtxt(content.splitlines(), delimiter=' ')

data

OUTPUT

array([[   2.728,    6.513],
       [   3.776,   26.114],
       [   5.595,   38.47 ],
       ...,
       [2003.23 , 1849.23 ],
       [2003.95 , 1928.41 ],
       [2004.09 , 1966.77 ]])
Challenge

Exercise 3(Begginer): Manipulating arrays

Suppose you have a the following array dna_array containing DNA sequences as strings.

PYTHON

dna_sequences = ["AGCT", "TCGA", "ATCG", "CGTA", "GATTACA"]
dna_array = np.array(dna_sequences)
print(dna_array)

You want to extract the sequences that meet a specific condition. Which NumPy function would you use to extract DNA sequences from dna_array that contain “AT”?

  1. np.extract(dna_array == 'AT', dna_array)

  2. np.where(dna_array == 'AT')

  3. np.extract(np.char.startswith(dna_array, 'AT'), dna_array)

  4. dna_array[np.char.count(dna_array, 'AT') > 0]

The correct is d).

DataFrame


As we see in the first episode, Pandas is an indispensable library specialized in working with tabular data or a DataFrame. We will import the Pandas with its usual alias pd.

PYTHON

import pandas as pd

One form to create a DataFrame is with a dictionary. We will use the dictionary thisdict and then we will convert it to a dataframe.

PYTHON

print(thisdict)

OUTPUT

{'brand': ['Ford', 'Toyota', 'Honda'], 'model': ['Mustang', 'Corolla', 'Civic'], 'year': [1964, 2020, 2019], 'color': ['red', 'blue', 'green'], 'price': [15000, 20000, 18000]}

We will add a row name with index.

PYTHON

df = pd.DataFrame(thisdict, index = ['C1', 'C2', 'C3'])
print(df)

OUTPUT

     brand    model  year  color  price
C1    Ford  Mustang  1964    red  15000
C2  Toyota  Corolla  2020   blue  20000
C3   Honda    Civic  2019  green  18000

We can explore the dataframe using different methods, such as head(), tail(), info(), describe(), etc. For example:

PYTHON

# Print the first two rows
print(df.head(2))

OUTPUT

     brand    model  year color  price
C1    Ford  Mustang  1964   red  15000
C2  Toyota  Corolla  2020  blue  20000

PYTHON

# Print the last two rows
print(df.tail(2))

OUTPUT

     brand    model  year  color  price
C2  Toyota  Corolla  2020   blue  20000
C3   Honda    Civic  2019  green  18000

PYTHON

# Display summary statistics
print(df.describe())

OUTPUT

              year         price
count     3.000000      3.000000
mean   2001.000000  17666.666667
std      32.046841   2516.611478
min    1964.000000  15000.000000
25%    1991.500000  16500.000000
50%    2019.000000  18000.000000
75%    2019.500000  19000.000000
max    2020.000000  20000.000000

PYTHON

# Display the information of the DataFrame
print(df.info())

OUTPUT

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, C1 to C3
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   brand   3 non-null      object
 1   model   3 non-null      object
 2   year    3 non-null      int64
 3   color   3 non-null      object
 4   price   3 non-null      int64
dtypes: int64(2), object(3)
memory usage: 144.0+ bytes
None

Pandas provides different methods for indexing and selecting data from DataFrames. You can use iloc[] for integer-based indexing and loc[] for label-based indexing. For example:

  • Select columns by name:

PYTHON

print(df['model'])

OUTPUT

C1    Mustang
C2    Corolla
C3      Civic
Name: model, dtype: object
  • Select rows by index using iloc[]

PYTHON

print(df.iloc[1])

OUTPUT

brand     Toyota
model    Corolla
year        2020
color       blue
price      20000
Name: C2, dtype: object
  • Select rows by index name using loc[]

PYTHON

print(df.loc['C1'])

OUTPUT

brand       Ford
model    Mustang
year        1964
color        red
price      15000
Name: C1, dtype: object
  • Select rows and columns with iloc[]

PYTHON

print(df.iloc[0:2, 0:2])

OUTPUT

     brand    model
C1    Ford  Mustang
C2  Toyota  Corolla
  • Select multiple columns.

PYTHON

print(df[['model', 'year']])

OUTPUT

      model  year
C1  Mustang  1964
C2  Corolla  2020
C3    Civic  2019

We can filter the data by some conditions. For example:

PYTHON

print(df[df['year']> 2000])

OUTPUT

     brand    model  year  color  price
C2  Toyota  Corolla  2020   blue  20000
C3   Honda    Civic  2019  green  18000

We can modify DataFrames by adding or removing rows and columns, updating values, and performing various data transformations. For example:

PYTHON

# Add a column
df['mileage'] = [150000, 5000, 30000]
print(df)

OUTPUT

     brand    model  year  color  price  mileage
C1    Ford  Mustang  1964    red  15000   150000
C2  Toyota  Corolla  2020   blue  20000     5000
C3   Honda    Civic  2019  green  18000    30000

PYTHON

# Remove a column from the DataFrame
df.drop(columns=['color'], inplace=True)

print(df)

OUTPUT

     brand    model  year  price  mileage
C1    Ford  Mustang  1964  15000   150000
C2  Toyota  Corolla  2020  20000     5000
C3   Honda    Civic  2019  18000    30000

PYTHON

# Update values in a DataFrame
df.loc[df['model'] == 'Corolla', 'mileage'] = 25000

print(df)

OUTPUT

     brand    model  year  price  mileage
C1    Ford  Mustang  1964  15000   150000
C2  Toyota  Corolla  2020  20000    25000
C3   Honda    Civic  2019  18000    30000

Just like with numpy, we can read CSV files that are stored on our computer or on the internet.

PYTHON

# Read the url database
url = "https://raw.githubusercontent.com/carpentries-incubator/pangenomics/gh-pages/files/familias_minis.csv"
df_genes = pd.read_csv(url, index_col=0)
df_genes.head(5)

OUTPUT

                                  g_A909               g_2603V               g_515               g_NEM316
A909|MGIDGNCP_01408  A909|MGIDGNCP_01408  2603V|GBPINHCM_01420   515|LHMFJANI_01310  NEM316|AOGPFIKH_01528
A909|MGIDGNCP_00096  A909|MGIDGNCP_00096  2603V|GBPINHCM_00097   515|LHMFJANI_00097  NEM316|AOGPFIKH_00098
A909|MGIDGNCP_01343  A909|MGIDGNCP_01343                   NaN                  NaN  NEM316|AOGPFIKH_01415
A909|MGIDGNCP_01221  A909|MGIDGNCP_01221                   NaN   515|LHMFJANI_01130                    NaN
A909|MGIDGNCP_01268  A909|MGIDGNCP_01268  2603V|GBPINHCM_01231   515|LHMFJANI_01178  NEM316|AOGPFIKH_01341  
Challenge

Exercise 4(Advanced): Manipulating dataframes

Use the dataframe df_genes and add a column that counts how many genes are in each row, for example, the first row has 4 genes but the third row has only two.

With the following command you can add a column. The method count counts how many elements there are per row, with axis = 1 we can fix the column.

PYTHON

df_genes['Number of genes'] = df_genes.count(axis=1)
Callout

Another data structures: Sets and Tuples

Tuple

Tuples are indexed and ordered sequences, they are immutable, meaning they cannot be modified after creation. The elements of a tuple can be numbers, strings, or combinations of both types. A tuple is defined using parentheses ():

PYTHON

thistuple = ("bacteria", "archea", "fungus", 3)
print(thistuple)

OUTPUT

('bacteria', 'archea', 'fungus', 3)

To access the elements of a tuple, we can use its index. In Python, indexing typically starts at 0. This means that the first element in a sequence (such as a list, tuple, or string) has an index of 0, the second element has an index of 1, and so on.

PYTHON

first_element = thistuple[0]
print(first_element)

OUTPUT

bacteria

We can also access tuple elements using negative indices, which count from the end of the tuple towards the beginning.

PYTHON

print(thistuple[-1])

OUTPUT

3

Set

Sets are unindexed, unordered collections of unique element, duplicates are not allowed. The sets can contain different data types. They are defined using curly braces {}:

PYTHON

myset = {"bacteria", "archea", "fungus", 4}
print(myset)

OUTPUT

{'archea', 3, 'fungus', 'bacteria'}

Sets in Python are mutable, meaning you can add and remove elements, but you cannot directly modify existing elements. To add elements to a set, you can use the add() method.

PYTHON

myset.add('animal')
print(myset)

OUTPUT

{3, 'fungus', 'bacteria', 'animal', 'archea'}

To remove an element from a set, you can use the remove().

PYTHON

myset.remove(3)
print(myset)

OUTPUT

{'fungus', 'bacteria', 'animal', 'archea'}

Sets are useful when you need to store a collection of unique elements and perform efficient set operations such as removing duplicates and comparing collections.

Key Points
  • Gain familiarity with fundamental data structures such as tuples, sets, lists, dictionaries, and arrays.
  • Develop skills in manipulating data by accessing, modifying, and filtering elements within data structures.
  • Learn to work with tabular data using Pandas DataFrames, including loading and exploring data.