alphaviz.preprocessing¶
This module provides functions that are helping to preprocess the data.
Functions:
|
Convert DIA-NN style modifications to AlphaPept style modifications. |
|
Convert DIA-NN style modifications to MaxQuant style modifications. |
|
Filter the data frame based on the pattern (any value) in the specified column. |
|
Extract the leading razor protein sequence for the list of proteinIDs of the protein group from the pyteomics.fasta.IndexedUniProt object. |
|
For the specified peptide sequence extract all identified in the experiment ions and based on the specified ion_type return a list of booleans containing information for the b-ions whether the peptide is breaking after aligned amino acid or for the y-ion whether is breaking before aligned amino acid. |
|
Extract MS2 data as a data frame for the specified MSMS scan number and precursor ID from the 'msms.txt' MQ output file and raw file. |
|
Extract unique "Protein names" from the specified MaxQuant output file. |
|
Get the name and the length of the protein(s) from the fasta file specifying the protein id(s). |
|
Extract information about protein IDs, protein names and gene names from the "Fasta headers" column of the MQ output tables. |
|
Sort the string natural to humans, e.g. |
- alphaviz.preprocessing.convert_diann_ap_mod(sequence: str) str [source]¶
Convert DIA-NN style modifications to AlphaPept style modifications.
- Parameters
sequence (str) – A peptide sequence with a DIA-NN style modification.
- Returns
A peptide sequence with AlphaPept style modification.
- Return type
str
- alphaviz.preprocessing.convert_diann_mq_mod(sequence: str) str [source]¶
Convert DIA-NN style modifications to MaxQuant style modifications.
- Parameters
sequence (str) – A peptide sequence with a DIA-NN style modification.
- Returns
A peptide sequence with MaxQuant style modification.
- Return type
str
- alphaviz.preprocessing.filter_df(df: DataFrame, pattern: str, column: str, software: str) DataFrame [source]¶
Filter the data frame based on the pattern (any value) in the specified column.
- Parameters
df (pd.DataFrame) – The original data frame.
pattern (str) – The string to be used to filter of the data frame column.
column (str) – The column to be used to filter.
software (str) – The name of the software tool where the filtering is used.
- Returns
The filtered data frame.
- Return type
pd.DataFrame
- alphaviz.preprocessing.get_aa_seq(protein_id: str, fasta) str [source]¶
Extract the leading razor protein sequence for the list of proteinIDs of the protein group from the pyteomics.fasta.IndexedUniProt object.
- Parameters
protein_ids (str) – String containing the proteinIDs of all protein isoforms.
fasta (pyteomics.fasta.IndexedUniProt object) – The pyteomics.fasta.IndexedUniProt object.
- Returns
Protein sequence for the leading razor protein, e.g. from the list of proteinIDs ‘Q15149;Q15149-7;Q15149-9;Q15149-5’ the AA sequence for protein Q15149 will be returned.
- Return type
str
- alphaviz.preprocessing.get_identified_ions(values: list, sequence: str, ion_type: str) list [source]¶
For the specified peptide sequence extract all identified in the experiment ions and based on the specified ion_type return a list of booleans containing information for the b-ions whether the peptide is breaking after aligned amino acid or for the y-ion whether is breaking before aligned amino acid.
E.g. for the peptide ‘NTINHN’ having the unique ion values [‘b2-H2O’, ‘b2’, ‘b3’, ‘b5-NH3’] it will return the following list of booleans: [False,True,True,False,True,False].
- Parameters
values (list) – The list of all unique identified ions for the peptide in the experiment.
sequence (str) – Peptide sequence.
ion (str) – Ion type, e.g. ‘b’ or ‘y’. Other ion types are not implemented.
- Returns
List of peptide length of booleans with True for the presenting ion and False for a missing one.
- Return type
list
- alphaviz.preprocessing.get_mq_ms2_scan_data(msms: DataFrame, selected_msms_scan: int, raw_data, precursor_id: int) DataFrame [source]¶
Extract MS2 data as a data frame for the specified MSMS scan number and precursor ID from the ‘msms.txt’ MQ output file and raw file.
- Parameters
msms (pd.DataFrame) – Pre-loaded ‘msms.txt’ MQ output file.
selected_msms_scan (int) – MSMS scan number.
raw_data (AlphaTims TimsTOF object) – AlphaTims TimsTOF object.
precursor_id (int) – The identifier of the precursor.
- Returns
- For the specified MSMS scan and precursor ID, the extracted data frame contains the following columns:
’mz_values’
’intensity_values’
’ions’
’wrong_dev_value’: whether the mass_deviation specified in the MQ table was incorrect.
- Return type
pd.DataFrame
- alphaviz.preprocessing.get_mq_unique_proteins(filepath: str) list [source]¶
Extract unique “Protein names” from the specified MaxQuant output file.
- Parameters
filepath (str) – Full path to the file.
- Returns
A list of unique protein names from the specified file.
- Return type
list
- alphaviz.preprocessing.get_protein_info(fasta: dict, protein_ids: str)[source]¶
Get the name and the length of the protein(s) from the fasta file specifying the protein id(s).
- Parameters
fasta (pyteomics.fasta.IndexedUniProt object) – The Pyteomics object contains information about all proteins from the .fasta file.
protein_ids (str) – The list of the protein IDs separated by comma.
- Returns
The name and the length of the specified protein(s).
- Return type
tuple of strings
- alphaviz.preprocessing.get_protein_info_from_fastaheader(string: str, **kwargs)[source]¶
Extract information about protein IDs, protein names and gene names from the “Fasta headers” column of the MQ output tables.
- Parameters
string (str) –
A ‘Fasta header’ string from the MQ output table for one protein group (e.g. from the proteinGroups.txt file).
E.g. a complex one: ‘sp|Q3SY84|K2C71_HUMAN Keratin, type II cytoskeletal 71 OS=Homo sapiens OX=9606 GN=KRT71 PE=1 SV=3;;sp|Q14CN4|K2C72_HUMAN Keratin, type II cytoskeletal 72 OS=Homo sapiens OX=9606 GN=KRT72 PE=1 SV=2;;;sp|Q7RTS7|K2C74_HUMAN Keratin, type II cytoskeletal 74 OS’
- Returns
The function returns a tuple of three strings containing information about the protein names, protein IDs and gene names.
- Return type
a tuple of strings
- alphaviz.preprocessing.sort_naturally(line: str, reverse: bool = False) str [source]¶
Sort the string natural to humans, e.g. 4,1,6,11 will be sorted as 1,4,6,11 and not like 1,11,4,6.
- Parameters
line (str) – The string to be sorted.
reverse (bool) – Whether to apply the reverse option or not. Defaults: False.
- Returns
A naturally sorted string.
- Return type
str