alphaviz.preprocessing

This module provides functions that are helping to preprocess the data.

Functions:

convert_diann_ap_mod(sequence)

Convert DIA-NN style modifications to AlphaPept style modifications.

convert_diann_mq_mod(sequence)

Convert DIA-NN style modifications to MaxQuant style modifications.

filter_df(df, pattern, column, software)

Filter the data frame based on the pattern (any value) in the specified column.

get_aa_seq(protein_id, fasta)

Extract the leading razor protein sequence for the list of proteinIDs of the protein group from the pyteomics.fasta.IndexedUniProt object.

get_identified_ions(values, sequence, ion_type)

For the specified peptide sequence extract all identified in the experiment ions and based on the specified ion_type return a list of booleans containing information for the b-ions whether the peptide is breaking after aligned amino acid or for the y-ion whether is breaking before aligned amino acid.

get_mq_ms2_scan_data(msms, ...)

Extract MS2 data as a data frame for the specified MSMS scan number and precursor ID from the 'msms.txt' MQ output file and raw file.

get_mq_unique_proteins(filepath)

Extract unique "Protein names" from the specified MaxQuant output file.

get_protein_info(fasta, protein_ids)

Get the name and the length of the protein(s) from the fasta file specifying the protein id(s).

get_protein_info_from_fastaheader(string, ...)

Extract information about protein IDs, protein names and gene names from the "Fasta headers" column of the MQ output tables.

sort_naturally(line[, reverse])

Sort the string natural to humans, e.g.

alphaviz.preprocessing.convert_diann_ap_mod(sequence: str) str[source]

Convert DIA-NN style modifications to AlphaPept style modifications.

Parameters

sequence (str) – A peptide sequence with a DIA-NN style modification.

Returns

A peptide sequence with AlphaPept style modification.

Return type

str

alphaviz.preprocessing.convert_diann_mq_mod(sequence: str) str[source]

Convert DIA-NN style modifications to MaxQuant style modifications.

Parameters

sequence (str) – A peptide sequence with a DIA-NN style modification.

Returns

A peptide sequence with MaxQuant style modification.

Return type

str

alphaviz.preprocessing.filter_df(df: DataFrame, pattern: str, column: str, software: str) DataFrame[source]

Filter the data frame based on the pattern (any value) in the specified column.

Parameters
  • df (pd.DataFrame) – The original data frame.

  • pattern (str) – The string to be used to filter of the data frame column.

  • column (str) – The column to be used to filter.

  • software (str) – The name of the software tool where the filtering is used.

Returns

The filtered data frame.

Return type

pd.DataFrame

alphaviz.preprocessing.get_aa_seq(protein_id: str, fasta) str[source]

Extract the leading razor protein sequence for the list of proteinIDs of the protein group from the pyteomics.fasta.IndexedUniProt object.

Parameters
  • protein_ids (str) – String containing the proteinIDs of all protein isoforms.

  • fasta (pyteomics.fasta.IndexedUniProt object) – The pyteomics.fasta.IndexedUniProt object.

Returns

Protein sequence for the leading razor protein, e.g. from the list of proteinIDs ‘Q15149;Q15149-7;Q15149-9;Q15149-5’ the AA sequence for protein Q15149 will be returned.

Return type

str

alphaviz.preprocessing.get_identified_ions(values: list, sequence: str, ion_type: str) list[source]

For the specified peptide sequence extract all identified in the experiment ions and based on the specified ion_type return a list of booleans containing information for the b-ions whether the peptide is breaking after aligned amino acid or for the y-ion whether is breaking before aligned amino acid.

E.g. for the peptide ‘NTINHN’ having the unique ion values [‘b2-H2O’, ‘b2’, ‘b3’, ‘b5-NH3’] it will return the following list of booleans: [False,True,True,False,True,False].

Parameters
  • values (list) – The list of all unique identified ions for the peptide in the experiment.

  • sequence (str) – Peptide sequence.

  • ion (str) – Ion type, e.g. ‘b’ or ‘y’. Other ion types are not implemented.

Returns

List of peptide length of booleans with True for the presenting ion and False for a missing one.

Return type

list

alphaviz.preprocessing.get_mq_ms2_scan_data(msms: DataFrame, selected_msms_scan: int, raw_data, precursor_id: int) DataFrame[source]

Extract MS2 data as a data frame for the specified MSMS scan number and precursor ID from the ‘msms.txt’ MQ output file and raw file.

Parameters
  • msms (pd.DataFrame) – Pre-loaded ‘msms.txt’ MQ output file.

  • selected_msms_scan (int) – MSMS scan number.

  • raw_data (AlphaTims TimsTOF object) – AlphaTims TimsTOF object.

  • precursor_id (int) – The identifier of the precursor.

Returns

For the specified MSMS scan and precursor ID, the extracted data frame contains the following columns:
  • ’mz_values’

  • ’intensity_values’

  • ’ions’

  • ’wrong_dev_value’: whether the mass_deviation specified in the MQ table was incorrect.

Return type

pd.DataFrame

alphaviz.preprocessing.get_mq_unique_proteins(filepath: str) list[source]

Extract unique “Protein names” from the specified MaxQuant output file.

Parameters

filepath (str) – Full path to the file.

Returns

A list of unique protein names from the specified file.

Return type

list

alphaviz.preprocessing.get_protein_info(fasta: dict, protein_ids: str)[source]

Get the name and the length of the protein(s) from the fasta file specifying the protein id(s).

Parameters
  • fasta (pyteomics.fasta.IndexedUniProt object) – The Pyteomics object contains information about all proteins from the .fasta file.

  • protein_ids (str) – The list of the protein IDs separated by comma.

Returns

The name and the length of the specified protein(s).

Return type

tuple of strings

alphaviz.preprocessing.get_protein_info_from_fastaheader(string: str, **kwargs)[source]

Extract information about protein IDs, protein names and gene names from the “Fasta headers” column of the MQ output tables.

Parameters

string (str) –

A ‘Fasta header’ string from the MQ output table for one protein group (e.g. from the proteinGroups.txt file).

E.g. a complex one: ‘sp|Q3SY84|K2C71_HUMAN Keratin, type II cytoskeletal 71 OS=Homo sapiens OX=9606 GN=KRT71 PE=1 SV=3;;sp|Q14CN4|K2C72_HUMAN Keratin, type II cytoskeletal 72 OS=Homo sapiens OX=9606 GN=KRT72 PE=1 SV=2;;;sp|Q7RTS7|K2C74_HUMAN Keratin, type II cytoskeletal 74 OS’

Returns

The function returns a tuple of three strings containing information about the protein names, protein IDs and gene names.

Return type

a tuple of strings

alphaviz.preprocessing.sort_naturally(line: str, reverse: bool = False) str[source]

Sort the string natural to humans, e.g. 4,1,6,11 will be sorted as 1,4,6,11 and not like 1,11,4,6.

Parameters
  • line (str) – The string to be sorted.

  • reverse (bool) – Whether to apply the reverse option or not. Defaults: False.

Returns

A naturally sorted string.

Return type

str