7  Extract GSE data from NCBI database

The NCBI website maintains Gene Expression Omnibus (GEO), a public functional genomics data repository, and provides a GEO accession display tool to display GEO accessions.

For example, look at the GEO accession for GSE109816.

In this notebook we’ll explore how to access that programtically using Python.

7.1 GEOparse

There is a Python library GEOparse to access the GEO Database.

You can install it using:

!pip install GEOparse
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: GEOparse in /home/anand/.local/lib/python3.10/site-packages (2.0.3)
Requirement already satisfied: numpy>=1.7 in /home/anand/.local/lib/python3.10/site-packages (from GEOparse) (1.24.1)
Requirement already satisfied: tqdm>=4.31.1 in /home/anand/.local/lib/python3.10/site-packages (from GEOparse) (4.65.0)
Requirement already satisfied: pandas>=0.17 in /home/anand/.local/lib/python3.10/site-packages (from GEOparse) (1.5.2)
Requirement already satisfied: requests>=2.21.0 in /home/anand/.local/lib/python3.10/site-packages (from GEOparse) (2.28.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/anand/.local/lib/python3.10/site-packages (from pandas>=0.17->GEOparse) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/lib/python3/dist-packages (from pandas>=0.17->GEOparse) (2022.1)
Requirement already satisfied: charset-normalizer<3,>=2 in /home/anand/.local/lib/python3.10/site-packages (from requests>=2.21.0->GEOparse) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /usr/lib/python3/dist-packages (from requests>=2.21.0->GEOparse) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests>=2.21.0->GEOparse) (2020.6.20)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/lib/python3/dist-packages (from requests>=2.21.0->GEOparse) (1.26.5)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.1->pandas>=0.17->GEOparse) (1.16.0)

Once it is installed, you can import the module.

import GEOparse
geo_id = "GSE109816"
gse = GEOparse.get_GEO(geo=geo_id)
19-May-2023 21:40:54 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:40:54 INFO GEOparse - File already exist: using local version.
19-May-2023 21:40:54 INFO GEOparse - Parsing ./GSE109816_family.soft.gz: 
19-May-2023 21:40:54 DEBUG GEOparse - DATABASE: GeoMiame
19-May-2023 21:40:54 DEBUG GEOparse - SERIES: GSE109816
19-May-2023 21:40:54 DEBUG GEOparse - PLATFORM: GPL18573
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970358
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970359
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970360
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970361
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970362
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970363
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970364
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970365
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970366
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970367
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970368
19-May-2023 21:40:54 DEBUG GEOparse - SAMPLE: GSM2970369
gse
<SERIES: None - 12 SAMPLES, 0 d(s)>
gse.name
gse.geotype
'SERIES'
gse.gpls
{}

7.2 Extracing Information from the record

We are interested to extract the following information from the record:

  • Title
  • Organism
  • Experiment type
  • Summary
  • Contact name
  • Contributor
  • Submitter
  • Overall Design
  • Platform (available by following the the platform ID link)

The gse.metatadata has all these fields.

title = gse.metadata['title'][0]
expression_type = gse.metadata['type'][0]
summary = gse.metadata['summary'][0]
contact_name = gse.metadata['contact_name'][0]
contributors = gse.metadata['contributor']
overall_design = gse.metadata['overall_design'][0]

print("Title:", title)
print("Contact Name:", contact_name)
print("Contributors:", contributor)
print()

print("Expression Type:", expression_type)
print("Overall Design:", overall_design)
print()

print("Summary:")
print(summary)
print()
Title: Dissecting cell composition and cell-cell interaction network of normal human heart tissue by single-cell sequencing
Contact Name: Li,,Wang
Contributors: ['Li,,Wang', 'Peng,,Yu', 'Zheng,,Li', 'Zongna,,Ren']

Expression Type: Expression profiling by high throughput sequencing
Overall Design: Extract cells from left ateria and ventricle of normal heart and conduct single-cell sequencing of 9994 cells.

Summary:
We studied the cell compositon of normal human heart by single-cell sequencing. Distint subgroups of cardiac muscle, fibroblast cell and endothelial cell were detected. We drawed a cell-cell interaction network using specific expressed ligands and receptors of cells. And we also observed the change of interaction and cell transformation with age.

We got all the fields except the organism and technology type. They are references to other geo records.

organism_id = gse.metadata['sample_taxid'][0]
platform_id = gse.metadata['platform_id'][0]
organism_id
'9606'
platform_id
'GPL18573'
from Bio import Entrez
Entrez.email = "anand+pybfx@pipal.in"

def get_tax_data(taxid):
    handle = Entrez.efetch(id=taxid, db="taxonomy", retmode="xml")
    records = Entrez.read(handle)
    if records:
        return records[0]

def get_scientific_name(tax_id):
    data = get_tax_data(tax_id)
    return data['ScientificName']
get_scientific_name("9606")
'Homo sapiens'
platform_id
'GPL18573'
geo_tech = GEOparse.get_GEO(platform_id)
19-May-2023 21:24:57 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:24:57 INFO GEOparse - File already exist: using local version.
19-May-2023 21:24:57 INFO GEOparse - Parsing ./GPL18573.txt: 
19-May-2023 21:24:57 DEBUG GEOparse - PLATFORM: GPL18573
platform = geo_tech.metadata['title'][0]
platform
'Illumina NextSeq 500 (Homo sapiens)'

7.2.1 Putting all of this together

import GEOparse
from Bio import Entrez

# replace this with your email
Entrez.email = "anand+pybfx@pipal.in"


def get_tax_data(taxid):
    handle = Entrez.efetch(id=taxid, db="taxonomy", retmode="xml")
    records = Entrez.read(handle)
    if records:
        return records[0]

def get_scientific_name(tax_id):
    data = get_tax_data(tax_id)
    return data['ScientificName']

def get_geo_title(geo_id):
    return GEOparse.get_GEO(technology_type_id).metadata['title'][0]

geo_id = "GSE109816"

gse = GEOparse.get_GEO(geo=geo_id)

title = gse.metadata['title'][0]
expression_type = gse.metadata['type'][0]

contact_name = gse.metadata['contact_name'][0]
contributors = gse.metadata['contributor']
overall_design = gse.metadata['overall_design'][0]

organization_id = gse.metadata['sample_taxid'][0]
organization = get_scientific_name(organization_id)

platform_id = gse.metadata['platform_id']
platform = get_geo_title(platform_id)

summary = gse.metadata['summary'][0]

print("Title:", title)
print("Contact Name:", contact_name)
print("Contributors:", contributor)
print()

print("Expression Type:", expression_type)
print("Overall Design:", overall_design)
print()

print("Organization:", organization)
print("Platform:", platform)
print()

print("Summary:")
print(summary)
print()
19-May-2023 21:25:26 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:25:26 INFO GEOparse - File already exist: using local version.
19-May-2023 21:25:26 INFO GEOparse - Parsing ./GSE109816_family.soft.gz: 
19-May-2023 21:25:26 DEBUG GEOparse - DATABASE: GeoMiame
19-May-2023 21:25:26 DEBUG GEOparse - SERIES: GSE109816
19-May-2023 21:25:26 DEBUG GEOparse - PLATFORM: GPL18573
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970358
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970359
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970360
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970361
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970362
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970363
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970364
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970365
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970366
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970367
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970368
19-May-2023 21:25:26 DEBUG GEOparse - SAMPLE: GSM2970369
19-May-2023 21:25:27 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:25:27 INFO GEOparse - File already exist: using local version.
19-May-2023 21:25:27 INFO GEOparse - Parsing ./GPL18573.txt: 
19-May-2023 21:25:27 DEBUG GEOparse - PLATFORM: GPL18573
Title: Dissecting cell composition and cell-cell interaction network of normal human heart tissue by single-cell sequencing
Contact Name: Li,,Wang
Contributors: ['Li,,Wang', 'Peng,,Yu', 'Zheng,,Li', 'Zongna,,Ren']

Expression Type: Expression profiling by high throughput sequencing
Overall Design: Extract cells from left ateria and ventricle of normal heart and conduct single-cell sequencing of 9994 cells.

Organization: Homo sapiens
Platform: Illumina NextSeq 500 (Homo sapiens)

Summary:
We studied the cell compositon of normal human heart by single-cell sequencing. Distint subgroups of cardiac muscle, fibroblast cell and endothelial cell were detected. We drawed a cell-cell interaction network using specific expressed ligands and receptors of cells. And we also observed the change of interaction and cell transformation with age.

7.3 Extracting Multiple Records and saving as CSV

We can change the prpgram to download multiple records together and convert the data as a pandas Dataframe and then export to a csv file.

import GEOparse
from Bio import Entrez
import pandas as pd

# replace this with your email
Entrez.email = "anand+pybfx@pipal.in"
def get_tax_data(taxid):
    """Returns the record from the NCBI taxonomy database.
    """
    handle = Entrez.efetch(id=taxid, db="taxonomy", retmode="xml")
    records = Entrez.read(handle)
    if records:
        return records[0]

def get_scientific_name(tax_id):
    """Returns the scientific name given a taxonomy id.
    """
    data = get_tax_data(tax_id)
    return data['ScientificName']

def get_geo_title(geo_id):
    """Returns the title from the GEO record given the id.
    """
    return GEOparse.get_GEO(technology_type_id).metadata['title'][0]

def get_geo_record(geo_id):
    """Returns the GEO record as a dictionary.
    """
    gse = GEOparse.get_GEO(geo=geo_id)

    title = gse.metadata['title'][0]
    expression_type = gse.metadata['type'][0]

    contact_name = gse.metadata['contact_name'][0]
    
    # we need contributors as a single field, seperating them with |
    contributors = " | ".join(gse.metadata['contributor'])
    
    overall_design = gse.metadata['overall_design'][0]

    organization_id = gse.metadata['sample_taxid'][0]
    organization = get_scientific_name(organization_id)

    platform_id = gse.metadata['platform_id']
    platform = get_geo_title(platform_id)

    summary = gse.metadata['summary'][0]
    return {
        "id": geo_id,
        "title": title,
        "expression_type": expression_type,
        "contact_name": contact_name,
        "contributors": contributors,
        "overall_design": overall_design,
        "organization": organization,
        "platform": platform
    }

Let’s see if the get_geo_record is working.

get_geo_record("GSE109816")
19-May-2023 21:31:37 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:31:37 INFO GEOparse - File already exist: using local version.
19-May-2023 21:31:37 INFO GEOparse - Parsing ./GSE109816_family.soft.gz: 
19-May-2023 21:31:37 DEBUG GEOparse - DATABASE: GeoMiame
19-May-2023 21:31:37 DEBUG GEOparse - SERIES: GSE109816
19-May-2023 21:31:37 DEBUG GEOparse - PLATFORM: GPL18573
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970358
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970359
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970360
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970361
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970362
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970363
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970364
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970365
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970366
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970367
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970368
19-May-2023 21:31:37 DEBUG GEOparse - SAMPLE: GSM2970369
19-May-2023 21:31:39 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:31:39 INFO GEOparse - File already exist: using local version.
19-May-2023 21:31:39 INFO GEOparse - Parsing ./GPL18573.txt: 
19-May-2023 21:31:39 DEBUG GEOparse - PLATFORM: GPL18573
{'title': 'Dissecting cell composition and cell-cell interaction network of normal human heart tissue by single-cell sequencing',
 'expression_type': 'Expression profiling by high throughput sequencing',
 'contact_name': 'Li,,Wang',
 'contributors': 'Li,,Wang | Peng,,Yu | Zheng,,Li | Zongna,,Ren',
 'overall_design': 'Extract cells from left ateria and ventricle of normal heart and conduct single-cell sequencing of 9994 cells.',
 'organization': 'Homo sapiens',
 'platform': 'Illumina NextSeq 500 (Homo sapiens)'}

That seems to be working.

Let’s see how to process mutliple of the them.

def get_geo_records(geo_ids):
    """Gets multiple geo records from NCBI geo database as a pandas dataframe.
    """
    records = [get_geo_record(geo_id) for geo_id in geo_ids]
    df = pd.DataFrame(records)
    df.set_index("id", inplace=True)
    return df
geo_ids = ["GSE109816", "GSE109817", "GSE109818", "GSE109819", "GSE109820"]
df = get_geo_records(geo_ids)
19-May-2023 21:50:57 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:50:57 INFO GEOparse - File already exist: using local version.
19-May-2023 21:50:57 INFO GEOparse - Parsing ./GSE109816_family.soft.gz: 
19-May-2023 21:50:57 DEBUG GEOparse - DATABASE: GeoMiame
19-May-2023 21:50:57 DEBUG GEOparse - SERIES: GSE109816
19-May-2023 21:50:57 DEBUG GEOparse - PLATFORM: GPL18573
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970358
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970359
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970360
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970361
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970362
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970363
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970364
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970365
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970366
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970367
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970368
19-May-2023 21:50:57 DEBUG GEOparse - SAMPLE: GSM2970369
19-May-2023 21:50:58 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:50:58 INFO GEOparse - File already exist: using local version.
19-May-2023 21:50:58 INFO GEOparse - Parsing ./GPL18573.txt: 
19-May-2023 21:50:58 DEBUG GEOparse - PLATFORM: GPL18573
19-May-2023 21:50:59 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:50:59 INFO GEOparse - File already exist: using local version.
19-May-2023 21:50:59 INFO GEOparse - Parsing ./GSE109817_family.soft.gz: 
19-May-2023 21:50:59 DEBUG GEOparse - DATABASE: GeoMiame
19-May-2023 21:50:59 DEBUG GEOparse - SERIES: GSE109817
19-May-2023 21:50:59 DEBUG GEOparse - PLATFORM: GPL19057
19-May-2023 21:50:59 DEBUG GEOparse - SAMPLE: GSM2970370
19-May-2023 21:50:59 DEBUG GEOparse - SAMPLE: GSM2970371
19-May-2023 21:50:59 DEBUG GEOparse - SAMPLE: GSM2970372
19-May-2023 21:50:59 DEBUG GEOparse - SAMPLE: GSM2970373
19-May-2023 21:50:59 DEBUG GEOparse - SAMPLE: GSM2970374
19-May-2023 21:50:59 DEBUG GEOparse - SAMPLE: GSM2970375
19-May-2023 21:50:59 DEBUG GEOparse - SAMPLE: GSM2970376
19-May-2023 21:50:59 DEBUG GEOparse - SAMPLE: GSM2970377
19-May-2023 21:50:59 DEBUG GEOparse - SAMPLE: GSM2970378
19-May-2023 21:50:59 DEBUG GEOparse - SAMPLE: GSM2970379
19-May-2023 21:50:59 DEBUG GEOparse - SAMPLE: GSM2970380
19-May-2023 21:50:59 DEBUG GEOparse - SAMPLE: GSM2970381
19-May-2023 21:51:00 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:51:00 INFO GEOparse - File already exist: using local version.
19-May-2023 21:51:00 INFO GEOparse - Parsing ./GPL18573.txt: 
19-May-2023 21:51:00 DEBUG GEOparse - PLATFORM: GPL18573
19-May-2023 21:51:01 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:51:01 INFO GEOparse - File already exist: using local version.
19-May-2023 21:51:01 INFO GEOparse - Parsing ./GSE109818_family.soft.gz: 
19-May-2023 21:51:01 DEBUG GEOparse - DATABASE: GeoMiame
19-May-2023 21:51:01 DEBUG GEOparse - SERIES: GSE109818
19-May-2023 21:51:01 DEBUG GEOparse - PLATFORM: GPL570
/home/anand/.local/lib/python3.10/site-packages/GEOparse/GEOparse.py:401: DtypeWarning: Columns (2) have mixed types. Specify dtype option on import or set low_memory=False.
  return read_csv(StringIO(data), index_col=None, sep="\t")
19-May-2023 21:51:03 DEBUG GEOparse - SAMPLE: GSM2970382
19-May-2023 21:51:03 DEBUG GEOparse - SAMPLE: GSM2970383
19-May-2023 21:51:03 DEBUG GEOparse - SAMPLE: GSM2970384
19-May-2023 21:51:03 DEBUG GEOparse - SAMPLE: GSM2970385
19-May-2023 21:51:03 DEBUG GEOparse - SAMPLE: GSM2970386
19-May-2023 21:51:03 DEBUG GEOparse - SAMPLE: GSM2970387
19-May-2023 21:51:03 DEBUG GEOparse - SAMPLE: GSM2970388
19-May-2023 21:51:03 DEBUG GEOparse - SAMPLE: GSM2970389
19-May-2023 21:51:03 DEBUG GEOparse - SAMPLE: GSM2970390
19-May-2023 21:51:05 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:51:05 INFO GEOparse - File already exist: using local version.
19-May-2023 21:51:05 INFO GEOparse - Parsing ./GPL18573.txt: 
19-May-2023 21:51:05 DEBUG GEOparse - PLATFORM: GPL18573
19-May-2023 21:51:06 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:51:06 INFO GEOparse - File already exist: using local version.
19-May-2023 21:51:06 INFO GEOparse - Parsing ./GSE109819_family.soft.gz: 
19-May-2023 21:51:06 DEBUG GEOparse - DATABASE: GeoMiame
19-May-2023 21:51:06 DEBUG GEOparse - SERIES: GSE109819
19-May-2023 21:51:06 DEBUG GEOparse - PLATFORM: GPL18133
19-May-2023 21:51:06 DEBUG GEOparse - SAMPLE: GSM2970396
19-May-2023 21:51:06 DEBUG GEOparse - SAMPLE: GSM2970397
19-May-2023 21:51:06 DEBUG GEOparse - SAMPLE: GSM2970398
19-May-2023 21:51:06 DEBUG GEOparse - SAMPLE: GSM2970399
19-May-2023 21:51:06 DEBUG GEOparse - SAMPLE: GSM2970400
19-May-2023 21:51:06 DEBUG GEOparse - SAMPLE: GSM2970401
19-May-2023 21:51:07 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:51:07 INFO GEOparse - File already exist: using local version.
19-May-2023 21:51:07 INFO GEOparse - Parsing ./GPL18573.txt: 
19-May-2023 21:51:07 DEBUG GEOparse - PLATFORM: GPL18573
19-May-2023 21:51:08 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:51:08 INFO GEOparse - File already exist: using local version.
19-May-2023 21:51:08 INFO GEOparse - Parsing ./GSE109820_family.soft.gz: 
19-May-2023 21:51:08 DEBUG GEOparse - DATABASE: GeoMiame
19-May-2023 21:51:08 DEBUG GEOparse - SERIES: GSE109820
19-May-2023 21:51:08 DEBUG GEOparse - PLATFORM: GPL11154
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970402
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970403
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970404
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970405
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970406
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970407
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970408
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970409
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970410
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970411
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970412
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970413
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970414
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970415
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970416
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970417
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970418
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970419
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970420
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970421
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970422
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970423
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970424
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970425
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970426
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970427
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970428
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970429
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970430
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970431
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970432
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970433
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970434
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970435
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970436
19-May-2023 21:51:08 DEBUG GEOparse - SAMPLE: GSM2970437
19-May-2023 21:51:09 DEBUG utils - Directory ./ already exists. Skipping.
19-May-2023 21:51:09 INFO GEOparse - File already exist: using local version.
19-May-2023 21:51:09 INFO GEOparse - Parsing ./GPL18573.txt: 
19-May-2023 21:51:09 DEBUG GEOparse - PLATFORM: GPL18573
df
title expression_type contact_name contributors overall_design organization platform
id
GSE109816 Dissecting cell composition and cell-cell inte... Expression profiling by high throughput sequen... Li,,Wang Li,,Wang | Peng,,Yu | Zheng,,Li | Zongna,,Ren Extract cells from left ateria and ventricle o... Homo sapiens Illumina NextSeq 500 (Homo sapiens)
GSE109817 RNA-sequencing of mouse adult hippocampal prog... Expression profiling by high throughput sequen... Michael,,Piper Lachlan,,Harris | Michael,,Piper Hippocampal nestin+ flox-reporter progenitor c... Mus musculus Illumina NextSeq 500 (Homo sapiens)
GSE109818 Changes in gene expression in human skeletal s... Expression profiling by array domenico,,raimondo Domenico,,Raimondo | Cristina,,Remoli | Letizi... Human bone marrow stromal cells (hBMSCs) (deri... Homo sapiens Illumina NextSeq 500 (Homo sapiens)
GSE109819 Transcriptome analysis of Escherichia coli str... Expression profiling by high throughput sequen... Pablo,Emiliano,Tomatis Pablo,E,Tomatis | Andreas,,Plueckthun 6 samples, 3 replicates Escherichia coli Illumina NextSeq 500 (Homo sapiens)
GSE109820 Early dynamics of ERa and GRHL2 binding on sti... Genome binding/occupancy profiling by high thr... Andrew,Nicholas,Holding Andrew,N,Holding | Florian,,Markowetz ChIP-seq data in MCF7 at three time-points for... Homo sapiens Illumina NextSeq 500 (Homo sapiens)
df.to_csv("gse.csv")
!cat gse.csv
id,title,expression_type,contact_name,contributors,overall_design,organization,platform
GSE109816,Dissecting cell composition and cell-cell interaction network of normal human heart tissue by single-cell sequencing,Expression profiling by high throughput sequencing,"Li,,Wang","Li,,Wang | Peng,,Yu | Zheng,,Li | Zongna,,Ren",Extract cells from left ateria and ventricle of normal heart and conduct single-cell sequencing of 9994 cells.,Homo sapiens,Illumina NextSeq 500 (Homo sapiens)
GSE109817,RNA-sequencing of mouse adult hippocampal progenitor cells in which Nfix was deleted.,Expression profiling by high throughput sequencing,"Michael,,Piper","Lachlan,,Harris | Michael,,Piper","Hippocampal nestin+ flox-reporter progenitor cells (3 wt, 3 kos - 60 days post tamoxifen adminstration), dcx+ flox-reporter progenitor cells (3 wt, 3 kos - 7 days post administration)",Mus musculus,Illumina NextSeq 500 (Homo sapiens)
GSE109818,Changes in gene expression in human skeletal stem cells transduced with constitutively active Gsα correlates with hallmark histopathological changes seen in fibrous dysplastic bone,Expression profiling by array,"domenico,,raimondo","Domenico,,Raimondo | Cristina,,Remoli | Letizia,,Astrologo | Romina,,Burla | Mattia,,La Torre | Fiammetta,,Verni’ | Enrico,,Tagliafico | Alessandro,,Corsi | Pamela,G,Robey | Mara,,Riminucci | Isabella,,Saggio | Agnese,,Persichetti | Letizia,,Astrologo | Simona,,Del Giudice | Giuseppe,,Giannicola","Human bone marrow stromal cells (hBMSCs) (derived from bone marrow aspirates) from three independent healthy donors were isolated. Lentiviral vectors (LV-GSαR201C and LV-ctr) were generated, produced and titrated.  The LV-vector integrated copy number was calculated by Q-PCR as described and was established as ~1 copy of integrated lentiviral sequence per transduced cell. hBMSCs were transduced with LV-GSαR201C and LV-ctr or mock treated.",Homo sapiens,Illumina NextSeq 500 (Homo sapiens)
GSE109819,Transcriptome analysis of Escherichia coli strains producing exogenous internal membrane proteins,Expression profiling by high throughput sequencing,"Pablo,Emiliano,Tomatis","Pablo,E,Tomatis | Andreas,,Plueckthun","6 samples, 3 replicates",Escherichia coli,Illumina NextSeq 500 (Homo sapiens)
GSE109820,Early dynamics of ERa and GRHL2 binding on stimulation with estradiol,Genome binding/occupancy profiling by high throughput sequencing,"Andrew,Nicholas,Holding","Andrew,N,Holding | Florian,,Markowetz",ChIP-seq data in MCF7 at three time-points for ERa and two time points for GRHL2 after stimulation with 100 nM E2,Homo sapiens,Illumina NextSeq 500 (Homo sapiens)

7.4 Summary

We’ve seen how to download data from a GEO accession from NCBI GEO profiles dataset using Python. We’ve also extended it to download multiple records and save it as pandas datafame and export as csv.

Please note that the GEOparse library downloads the GEO data into the current directory. See the Geoparse documentation to see how to use a different directory for storing the downloaded files.

7.5 References