alphaviz.preprocessing¶

This module provides functions that are helping to preprocess the data.

Functions:

`convert_diann_ap_mod`(sequence)	Convert DIA-NN style modifications to AlphaPept style modifications.
`convert_diann_mq_mod`(sequence)	Convert DIA-NN style modifications to MaxQuant style modifications.
`filter_df`(df, pattern, column, software)	Filter the data frame based on the pattern (any value) in the specified column.
`get_aa_seq`(protein_id, fasta)	Extract the leading razor protein sequence for the list of proteinIDs of the protein group from the pyteomics.fasta.IndexedUniProt object.
`get_identified_ions`(values, sequence, ion_type)	For the specified peptide sequence extract all identified in the experiment ions and based on the specified ion_type return a list of booleans containing information for the b-ions whether the peptide is breaking after aligned amino acid or for the y-ion whether is breaking before aligned amino acid.
`get_mq_ms2_scan_data`(msms, ...)	Extract MS2 data as a data frame for the specified MSMS scan number and precursor ID from the 'msms.txt' MQ output file and raw file.
`get_mq_unique_proteins`(filepath)	Extract unique "Protein names" from the specified MaxQuant output file.
`get_protein_info`(fasta, protein_ids)	Get the name and the length of the protein(s) from the fasta file specifying the protein id(s).
`get_protein_info_from_fastaheader`(string, ...)	Extract information about protein IDs, protein names and gene names from the "Fasta headers" column of the MQ output tables.
`sort_naturally`(line[, reverse])	Sort the string natural to humans, e.g.

alphaviz.preprocessing.convert_diann_ap_mod(sequence: str) → str[source]¶

Convert DIA-NN style modifications to AlphaPept style modifications.

Parameters: sequence (str) – A peptide sequence with a DIA-NN style modification.
Returns: A peptide sequence with AlphaPept style modification.
Return type: str

alphaviz.preprocessing.convert_diann_mq_mod(sequence: str) → str[source]¶

Convert DIA-NN style modifications to MaxQuant style modifications.

Parameters: sequence (str) – A peptide sequence with a DIA-NN style modification.
Returns: A peptide sequence with MaxQuant style modification.
Return type: str

alphaviz.preprocessing.filter_df(df: DataFrame, pattern: str, column: str, software: str) → DataFrame[source]¶

Filter the data frame based on the pattern (any value) in the specified column.

Parameters

df (pd.DataFrame) – The original data frame.
pattern (str) – The string to be used to filter of the data frame column.
column (str) – The column to be used to filter.
software (str) – The name of the software tool where the filtering is used.

Returns

The filtered data frame.

Return type

pd.DataFrame

alphaviz.preprocessing.get_aa_seq(protein_id: str, fasta) → str[source]¶

Extract the leading razor protein sequence for the list of proteinIDs of the protein group from the pyteomics.fasta.IndexedUniProt object.

Parameters

protein_ids (str) – String containing the proteinIDs of all protein isoforms.
fasta (pyteomics.fasta.IndexedUniProt object) – The pyteomics.fasta.IndexedUniProt object.

Returns

Protein sequence for the leading razor protein, e.g. from the list of proteinIDs ‘Q15149;Q15149-7;Q15149-9;Q15149-5’ the AA sequence for protein Q15149 will be returned.

Return type

str

alphaviz.preprocessing.get_identified_ions(values: list, sequence: str, ion_type: str) → list[source]¶

For the specified peptide sequence extract all identified in the experiment ions and based on the specified ion_type return a list of booleans containing information for the b-ions whether the peptide is breaking after aligned amino acid or for the y-ion whether is breaking before aligned amino acid.

E.g. for the peptide ‘NTINHN’ having the unique ion values [‘b2-H2O’, ‘b2’, ‘b3’, ‘b5-NH3’] it will return the following list of booleans: [False,True,True,False,True,False].

Parameters

values (list) – The list of all unique identified ions for the peptide in the experiment.
sequence (str) – Peptide sequence.
ion (str) – Ion type, e.g. ‘b’ or ‘y’. Other ion types are not implemented.

Returns

List of peptide length of booleans with True for the presenting ion and False for a missing one.

Return type

list

alphaviz.preprocessing.get_mq_ms2_scan_data(msms: DataFrame, selected_msms_scan: int, raw_data, precursor_id: int) → DataFrame[source]¶

Extract MS2 data as a data frame for the specified MSMS scan number and precursor ID from the ‘msms.txt’ MQ output file and raw file.

Parameters

msms (pd.DataFrame) – Pre-loaded ‘msms.txt’ MQ output file.
selected_msms_scan (int) – MSMS scan number.
raw_data (AlphaTims TimsTOF object) – AlphaTims TimsTOF object.
precursor_id (int) – The identifier of the precursor.

Returns

For the specified MSMS scan and precursor ID, the extracted data frame contains the following columns:

’mz_values’
’intensity_values’
’ions’
’wrong_dev_value’: whether the mass_deviation specified in the MQ table was incorrect.

Return type

pd.DataFrame

alphaviz.preprocessing.get_mq_unique_proteins(filepath: str) → list[source]¶

Extract unique “Protein names” from the specified MaxQuant output file.

Parameters: filepath (str) – Full path to the file.
Returns: A list of unique protein names from the specified file.
Return type: list

alphaviz.preprocessing.get_protein_info(fasta: dict, protein_ids: str)[source]¶

Get the name and the length of the protein(s) from the fasta file specifying the protein id(s).

Parameters

fasta (pyteomics.fasta.IndexedUniProt object) – The Pyteomics object contains information about all proteins from the .fasta file.
protein_ids (str) – The list of the protein IDs separated by comma.

Returns

The name and the length of the specified protein(s).

Return type

tuple of strings

alphaviz.preprocessing.get_protein_info_from_fastaheader(string: str, **kwargs)[source]¶

Extract information about protein IDs, protein names and gene names from the “Fasta headers” column of the MQ output tables.

Parameters

string (str) –

A ‘Fasta header’ string from the MQ output table for one protein group (e.g. from the proteinGroups.txt file).

Returns

The function returns a tuple of three strings containing information about the protein names, protein IDs and gene names.

Return type

a tuple of strings

alphaviz.preprocessing.sort_naturally(line: str, reverse: bool = False) → str[source]¶

Sort the string natural to humans, e.g. 4,1,6,11 will be sorted as 1,4,6,11 and not like 1,11,4,6.

Parameters

line (str) – The string to be sorted.
reverse (bool) – Whether to apply the reverse option or not. Defaults: False.

Returns

A naturally sorted string.

Return type

str